Chapter 9 (Classical Statistical Inference): Classical Parameter Estimation (经典参数估计)

本文为 I n t r o d u c t i o n Introduction Introduction t o to to P r o b a b i l i t y Probability Probability 的读书笔记

常用统计量的分布

  • (1) 标准正态分布 N ( 0 , 1 ) N(0,1) N(0,1)
    f ( x ) = 1 2 π e − x 2 2 f(x)=\frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}} f(x)=2π 1e2x2
  • (2) χ 2 \chi^2 χ2 分布 (卡方分布): 设 X 1 , . . . , X n X_1,...,X_n X1,...,Xn 相互独立,且均服从标准正态分布 N ( 0 , 1 ) N(0,1) N(0,1),则称
    χ 2 = X 1 2 + . . . + X n 2 \chi^2=X_1^2+...+X_n^2 χ2=X12+...+Xn2服从自由度 n n n χ 2 \chi^2 χ2 分布,记为 χ 2 ( n ) \chi^2(n) χ2(n)
    f ( x ) = 1 2 n 2 Γ ( n 2 ) x n 2 − 1 e − x 2 , x > 0 f ( x ) = 0 ,  其他  \begin{aligned} &f(x)=\frac{1}{2^{\frac{n}{2}} \Gamma\left(\frac{n}{2}\right)} x^{\frac{n}{2}-1} e^{-\frac{x}{2}}, x>0 \\ &f(x)=0, \text { 其他 } \end{aligned} f(x)=22nΓ(2n)1x2n1e2x,x>0f(x)=0, 其他 
    • χ 2 \chi^2 χ2 分布的可加性:若 X ∼ χ 2 ( n ) , Y ∼ χ 2 ( m ) X\sim\chi^2(n),Y\sim\chi^2(m) Xχ2(n),Yχ2(m),且 X X X Y Y Y 相互独立,则
      X + Y ∼ χ 2 ( n + m ) X+Y\sim\chi^2(n+m) X+Yχ2(n+m)
    • X ∼ χ 2 ( n ) X\sim\chi^2(n) Xχ2(n),则有
      E [ χ 2 ( n ) ] = n , v a r ( χ 2 ( n ) ) = 2 n E[\chi^2(n)]=n,var(\chi^2(n))=2n E[χ2(n)]=n,var(χ2(n))=2n
  • (3) t t t 分布: 设 X ∼ N ( 0 , 1 ) , Y ∼ χ 2 ( n ) X\sim N(0,1),Y\sim\chi^2(n) XN(0,1),Yχ2(n),且 X X X Y Y Y 相互独立,则称
    t = X Y / n t=\frac{X}{\sqrt{Y/n}} t=Y/n X服从自由度 n n n t t t 分布
    f ( x ) = Γ ( n + 1 2 ) n π Γ ( n 2 ) ( 1 + x 2 n ) − n + 1 2 f(x)=\frac{\Gamma(\frac{n+1}{2})}{\sqrt{n\pi}\Gamma(\frac{n}{2})}(1+\frac{x^2}{n})^{-\frac{n+1}{2}} f(x)=nπ Γ(2n)Γ(2n+1)(1+nx2)2n+1
    • 可以证明, lim ⁡ n → ∞ f ( x ) = 1 2 π e − x 2 2 \lim_{n\rightarrow\infty}f(x)=\frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}} limnf(x)=2π 1e2x2. 这表明 n n n 充分大 ( n ≥ 45 n\geq45 n45) 时,自由度为 n n n t t t 分布可以近似地看成是标准正态分布
  • (4) F F F 分布: 若 X ∼ χ 2 ( n ) , Y ∼ χ 2 ( m ) X\sim\chi^2(n),Y\sim\chi^2(m) Xχ2(n),Yχ2(m),且 X X X Y Y Y 相互独立,则称
    F = X / n Y / m F=\frac{X/n}{Y/m} F=Y/mX/n服从自由度 n , m n,m n,m F F F 分布,记为 F ( n , m ) F(n,m) F(n,m)

正态总体的抽样分布

  • 设总体 X ∼ N ( μ , σ 2 ) X\sim N(\mu,\sigma^2) XN(μ,σ2) X 1 , . . . , X n X_1,...,X_n X1,...,Xn 为总体 X X X 的简单随机样本,样本均值为 X ˉ \bar X Xˉ,样本方差为 S 2 S^2 S2,则有
    X ˉ − μ σ / n ∼ N ( 0 , 1 ) n − 1 σ 2 S 2 ∼ χ 2 ( n − 1 ) , 且 X ˉ 与 S 2 相 互 独 立 X ˉ − μ S / n ∼ t ( n − 1 ) \frac{\bar X-\mu}{\sigma/\sqrt n}\sim N(0,1)\\ \frac{n-1}{\sigma^2}S^2\sim \chi^2(n-1),且\bar X与S^2相互独立\\ \frac{\bar X-\mu}{S/\sqrt n}\sim t(n-1) σ/n XˉμN(0,1)σ2n1S2χ2(n1),XˉS2S/n Xˉμt(n1)

第 2 条结论证明超纲

Classical Statistical Inference

  • In the preceding chapter, we developed the Bayesian approach to inference, where unknown parameters are modeled as random variables. In all cases we worked within a single, fully-specified probabilistic model, and we based most of our derivations and calculations on judicious application of Bayes’ rule.

  • By contrast, in the present chapter we adopt a fundamentally different philosophy: we view the unknown parameter θ \theta θ as a deterministic (not random) but unknown quantity. The observation X X X is random and its distribution p X ( x ; θ ) p_X(x; \theta) pX(x;θ) [if X X X is discrete] or f X ( x ; θ ) f_X(x; \theta) fX(x;θ) [if X X X is continuous] depends on the value of θ \theta θ.
    在这里插入图片描述
  • Thus, instead of working within a single probabilistic model, we will be dealing simultaneously with multiple candidate models, one model for each possible value of θ \theta θ.
  • In this context, a “good” hypothesis testing or estimation procedure will be one that possesses certain desirable properties under every candidate model, that is, for every possible value of θ \theta θ. In some cases, this may be considered to be a worst case viewpoint: a procedure is not considered to fulfill our specifications unless it does so against the worst possible value that θ \theta θ can take. (只有在最坏情况仍能达到要求,才能被认为具有好的效果)
    • For example, we may require that the expected value of the estimation error be zero, or that the estimation error be small with high probability, for all possible values of the unknown parameter.

  • Our notation will generally indicate the dependence of probabilities and expected values on θ \theta θ.
    • For example, we will denote by E θ [ h ( X ) ] E_\theta[h(X)] Eθ[h(X)] the expected value of a random variable h ( X ) h(X) h(X) as a function of θ \theta θ. Similarly, we will use the notation P θ ( A ) P_\theta(A) Pθ(A) to denote the probability of an event A A A.
    • Note that this only indicates a functional dependence, not conditioning in the probabilistic sense.

Classical Parameter Estimation

Properties of Estimators

  • Given observations X = ( X 1 . . . . . X n ) X = (X_1 ..... X_n) X=(X1.....Xn), an estimator (估计量) is a random variable of the form Θ ^ = g ( X ) \hat\Theta= g(X) Θ^=g(X), for some function g g g.
  • Note that since the distribution of X X X depends on θ \theta θ, the same is true for the distribution of Θ ^ \hat\Theta Θ^. We use the term estimate (估计值) to refer to an actual realized value of Θ ^ \hat\Theta Θ^.

  • Sometimes, particularly when we are interested in the role of the number of observations n n n, we use the notation Θ ^ n \hat\Theta_n Θ^n for an estimator. It is then also appropriate to view Θ ^ n \hat\Theta_n Θ^n as a sequence of estimators (one for each value of n n n). The mean and variance of Θ ^ n \hat\Theta_n Θ^n are denoted E θ [ Θ ^ n ] E_\theta[\hat\Theta_n] Eθ[Θ^n] and v a r θ ( Θ ^ n ) var_\theta(\hat\Theta_n) varθ(Θ^n), respectively. Both E θ [ Θ ^ n ] E_\theta[\hat\Theta_n] Eθ[Θ^n] and v a r θ ( Θ ^ n ) var_\theta(\hat\Theta_n) varθ(Θ^n) are numerical functions of θ \theta θ, but for simplicity, when the context is clear we sometimes do not show this dependence.

Terminology Regarding Estimators

Let Θ ^ \hat\Theta Θ^ be an estimator of an unknown parameter θ \theta θ, that is, a function of n n n observations X 1 , . . . , X n X_1, ... , X_n X1,...,Xn whose distribution depends on θ \theta θ.

  • The estimation error, denoted by Θ ~ n \tilde\Theta_n Θ~n, is defined by Θ ~ n = Θ ^ n − θ \tilde\Theta_n=\hat\Theta_n-\theta Θ~n=Θ^nθ.
  • The bias of the estimator, denoted by b θ ( Θ ^ n ) b_\theta(\hat\Theta_n) bθ(Θ^n), is the expected value of the estimation error:
    b θ ( Θ ^ n ) = E θ [ Θ ^ n ] − θ b_\theta(\hat\Theta_n)=E_\theta[\hat\Theta_n]-\theta bθ(Θ^n)=Eθ[Θ^n]θ
  • The expected value, the variance, and the bias of Θ ^ n \hat\Theta_n Θ^n depend on θ \theta θ, while the estimation error depends in addition on the observations X 1 , . . . , X n X_1, ... ,X_n X1,...,Xn.
  • We call Θ ^ n \hat\Theta_n Θ^n unbiased (无偏) if E θ [ Θ ^ n ] = θ E_\theta[\hat\Theta_n]=\theta Eθ[Θ^n]=θ, for every possible value of θ \theta θ.
  • We call Θ ^ n \hat\Theta_n Θ^n asymptotically unbiased (渐近无偏) if lim ⁡ n → ∞ E θ [ Θ ^ n ] = θ \lim_{n\rightarrow\infty}E_\theta[\hat\Theta_n]=\theta limnEθ[Θ^n]=θ, for every possible value of θ \theta θ.
  • We call Θ ^ n \hat\Theta_n Θ^n consistent if the sequence Θ ^ n \hat\Theta_n Θ^n converges to the true value of the parameter θ \theta θ, in probability, for every possible value of θ \theta θ.

  • Besides the bias b θ ( Θ ^ n ) b_\theta(\hat\Theta_n) bθ(Θ^n), we are usually interested in the size of the estimation error. This is captured by the mean squared error E θ [ Θ ~ n 2 ] E_\theta[\tilde\Theta_n^2] Eθ[Θ~n2], which is related to the bias and the variance of Θ ^ n \hat\Theta_n Θ^n according to the following formula:
    E θ [ Θ ~ n 2 ] = b θ 2 ( Θ ^ n ) + v a r θ ( Θ ^ n ) E_\theta[\tilde\Theta_n^2]=b^2_\theta(\hat\Theta_n)+var_\theta(\hat\Theta_n) Eθ[Θ~n2]=bθ2(Θ^n)+varθ(Θ^n)
  • This formula is important because in many statistical problems. There is a tradeoff between the two terms on the right-hand-side. Often a reduction in the variance is accompanied by an increase in the bias. Of course, a good estimator is one that manages to keep both terms small.

Maximum Likelihood Estimation (最大似然估计)

This is a general method that bears similarity to MAP estimation.

  • Let the vector of observations X = ( X 1 , . . . , X n ) X = (X_1, ... , X_n) X=(X1,...,Xn) be described by a joint PMF p X ( x ; θ ) p_X(x;\theta) pX(x;θ) whose form depends on an unknown (scalar or vector) parameter θ \theta θ. Suppose we observe a particular value x = ( x 1 , . . . , x n ) x = (x_1, ... , x_n) x=(x1,...,xn) of X X X. Then, a maximum likelihood (ML) estimate is a value of the parameter that maximizes the numerical function p X ( x 1 , . . . , x n ; θ ) p_X(x_1,...,x_n;\theta) pX(x1,...,xn;θ) over all θ \theta θ:
    θ ^ n = arg ⁡ max ⁡ θ p X ( x 1 , . . . , x n ; θ ) \hat\theta_n=\arg\max_\theta p_X(x_1,...,x_n;\theta) θ^n=argθmaxpX(x1,...,xn;θ)For the case where X X X is continuous,
    θ ^ n = arg ⁡ max ⁡ θ f X ( x 1 , . . . , x n ; θ ) \hat\theta_n=\arg\max_\theta f_X(x_1,...,x_n;\theta) θ^n=argθmaxfX(x1,...,xn;θ)
  • We refer to p X ( x ; θ ) p_X(x;\theta) pX(x;θ) [or f X ( x ; θ ) f_X(x;\theta) fX(x;θ) if X X X is continuous] as the likelihood function (似然函数).
    在这里插入图片描述

  • In many applications, the observations X i X_i Xi are assumed to be independent, in which case, the likelihood function is of the form
    p X ( x 1 , . . . , x n ; θ ) = ∏ i = 1 n p X i ( x i ; θ ) p_X(x_1,...,x_n;\theta)=\prod_{i=1}^np_{X_i}(x_i;\theta) pX(x1,...,xn;θ)=i=1npXi(xi;θ)(for discrete X i X_i Xi). In this case, it is often analytically or computationally convenient to maximize its logarithm, called the log-likelihood function (对数似然函数),
    log ⁡ p X ( x 1 , . . . , x n ; θ ) = ∑ i = 1 n log ⁡ p X i ( x i ; θ ) \log p_X(x_1,...,x_n;\theta)=\sum_{i=1}^n\log p_{X_i}(x_i;\theta) logpX(x1,...,xn;θ)=i=1nlogpXi(xi;θ)over θ \theta θ. When X X X is continuous, there is a similar possibility, with PMFs replaced by PDFs: we maximize over θ \theta θ the expression
    log ⁡ f X ( x 1 , . . . , x n ; θ ) = ∑ i = 1 n log ⁡ f X i ( x i ; θ ) \log f_X(x_1,...,x_n;\theta)=\sum_{i=1}^n\log f_{X_i}(x_i;\theta) logfX(x1,...,xn;θ)=i=1nlogfXi(xi;θ)

  • Recall that in Bayesian MAP estimation, the estimate is chosen to maximize the expression p Θ ( θ ) p X ∣ Θ ( x ∣ θ ) p_\Theta(\theta)p_{X|\Theta}(x |\theta) pΘ(θ)pXΘ(xθ) over all θ \theta θ, where p Θ ( θ ) p_\Theta(\theta) pΘ(θ) is the prior PMF of an unknown discrete parameter θ \theta θ. Thus, if we view p X ( x ; θ ) p_X(x;\theta) pX(x;θ) as a conditional PMF, we may interpret ML estimation as MAP estimation with a flat prior (均匀先验), i.e., a prior which is the same for all θ \theta θ, indicating the absence of any useful prior knowledge.

Example 9.1.

  • Let us revisit Example 8.2, in which Juliet is always late by an amount X X X that is uniformly distributed over the interval [ 0 , θ ] [0,\theta] [0,θ], and θ \theta θ is an unknown parameter. In that example, we used a random variable Θ \Theta Θ with flat prior PDF f Θ ( θ ) f_\Theta(\theta) fΘ(θ) (uniform over the interval [ 0 , 1 ] [0, 1] [0,1]) to model the parameter. and we showed that the MAP estimate is the value x x x of X X X.
  • In the classical context of this section, there is no prior, and θ \theta θ is treated as a constant, but the ML estimate is also θ ^ = x \hat\theta=x θ^=x. The resulting estimator is Θ ^ = X \hat\Theta=X Θ^=X.

Example 9.4. Estimating the Mean and Variance of a Normal.

  • Consider the problem of estimating the mean μ μ μ and variance v v v of a normal distribution using n n n independent observations X 1 , . . . , X n X_1, ... , X_n X1,...,Xn. The parameter vector here is θ = ( μ , v ) \theta = (μ, v) θ=(μ,v). The corresponding likelihood function is
    f X ( x ; μ , v ) = ∏ i = 1 n f X i ( x i ; μ , v ) = ∏ i = 1 n 1 2 π v e − ( x i − μ ) 2 / 2 v = 1 ( 2 π v ) n / 2 ∏ i = 1 n e − ( x i − μ ) 2 / 2 v f_X(x;\mu,v)=\prod_{i=1}^nf_{X_i}(x_i;\mu,v)=\prod_{i=1}^n\frac{1}{\sqrt{2\pi v}}e^{-(x_i-\mu)^2/2v}=\frac{1}{(2\pi v)^{n/2}}\prod_{i=1}^ne^{-(x_i-\mu)^2/2v} fX(x;μ,v)=i=1nfXi(xi;μ,v)=i=1n2πv 1e(xiμ)2/2v=(2πv)n/21i=1ne(xiμ)2/2vAfter some calculation it can be written as
    f X ( x ; μ , v ) = 1 ( 2 π v ) n / 2 ⋅ exp ⁡ { − n s n 2 2 v } ⋅ exp ⁡ { − n ( m n − μ ) 2 2 v } f_X(x;\mu,v)=\frac{1}{(2\pi v)^{n/2}}\cdot\exp\bigg\{-\frac{ns_n^2}{2v}\bigg\}\cdot\exp\bigg\{-\frac{n(m_n-\mu)^2}{2v}\bigg\} fX(x;μ,v)=(2πv)n/21exp{2vnsn2}exp{2vn(mnμ)2}where m n m_n mn is the realized value of the random variable
    M n = 1 n ∑ i = 1 n X i M_n=\frac{1}{n}\sum_{i=1}^nX_i Mn=n1i=1nXiand s n 2 s_n^2 sn2 is the realized value of the random variable
    S ‾ n 2 = 1 n ∑ i = 1 n ( X i − M n ) 2 \overline S_n^2=\frac{1}{n}\sum_{i=1}^n(X_i-M_n)^2 Sn2=n1i=1n(XiMn)2
    • To verify this, write for i = 1 , . . . , n i = 1, ... , n i=1,...,n,
      ( x i − μ ) 2 = ( x i − m n + m n − μ ) 2 = ( x i − m n ) 2 + ( m n − μ ) 2 + 2 ( x i − m n ) ( m n − μ ) (x_i-\mu)^2=(x_i-m_n+m_n-\mu)^2=(x_i-m_n)^2+(m_n-\mu)^2+2(x_i-m_n)(m_n-\mu) (xiμ)2=(ximn+mnμ)2=(ximn)2+(mnμ)2+2(ximn)(mnμ)sum over i i i, and note that
      ∑ i = 1 n ( x i − m n ) ( m n − μ ) = ( m n − μ ) ∑ i = 1 n ( x i − m n ) = 0 \sum_{i=1}^n(x_i-m_n)(m_n-\mu) = (m_n-\mu)\sum_{i=1}^n(x_i-m_n)= 0 i=1n(ximn)(mnμ)=(mnμ)i=1n(ximn)=0
  • The log-likelihood function is
    log ⁡ f X ( x ; μ , v ) = − n 2 ⋅ log ⁡ ( 2 π ) − n 2 ⋅ log ⁡ ( v ) − n s n 2 2 v − n ( m n − μ ) 2 2 v \log f_X(x;\mu,v)=-\frac{n}{2}\cdot\log(2\pi)-\frac{n}{2}\cdot\log(v)-\frac{ns_n^2}{2v}-\frac{n(m_n-\mu)^2}{2v} logfX(x;μ,v)=2nlog(2π)2nlog(v)2vnsn22vn(mnμ)2Setting to zero the derivatives of this function with respect to μ μ μ and v v v, we obtain the estimate and estimator, respectively,
    θ ^ n = ( m n , s n 2 )              Θ ^ n = ( M n , S ‾ n 2 ) \hat\theta_n=(m_n,s_n^2)\ \ \ \ \ \ \ \ \ \ \ \ \hat\Theta_n=(M_n,\overline S_n^2) θ^n=(mn,sn2)            Θ^n=(Mn,Sn2)Note that M n M_n Mn is the sample mean, while S ‾ n 2 \overline S_n^2 Sn2 may be viewed as a “sample variance.” As will be shown shortly, E θ [ S ‾ n 2 ] E_\theta[\overline S_n^2] Eθ[Sn2] converges to v v v as n n n increases, so that S ‾ n 2 \overline S_n^2 Sn2 is asymptotically unbiased. Using also the weak law of large numbers, it can be shown that M n M_n Mn and S n S_n Sn are consistent estimators of μ μ μ and v v v, respectively.

  • Maximum likelihood estimation has some appealing properties.
    • For example, it obeys the invariance principle (不变原理): if Θ ^ n \hat\Theta_n Θ^n is the ML estimate of θ \theta θ, then for any one-to-one function h h h of θ \theta θ, the ML estimate of the parameter ζ = h ( θ ) \zeta=h(\theta) ζ=h(θ) is h ( Θ ^ n ) h(\hat\Theta_n) h(Θ^n).
    • Also. when the observations are i.i.d. (independent identically distributed), and under some mild additional assumptions, it can be shown that the ML estimator is consistent.
    • Another interesting property is that when θ \theta θ is a scalar parameter, then under some mild conditions, the ML estimator has an asymptotic normality 渐近正态性质 property. In particular, it can be shown that the distribution of ( Θ ^ n − θ ) / σ ( Θ ^ n ) (\hat\Theta_n-\theta)/\sigma(\hat\Theta_n) (Θ^nθ)/σ(Θ^n), where σ 2 ( Θ ^ n ) \sigma^2(\hat\Theta_n) σ2(Θ^n) is the variance of Θ ^ n \hat\Theta_n Θ^n, approaches a standard normal distribution. Thus, if we are able to also estimate σ ( Θ ^ n ) \sigma(\hat\Theta_n) σ(Θ^n), we can use it to derive an error variance estimate based on a normal approximation. When θ \theta θ is a vector parameter, a similar statement applies to each one of its components.

Estimation of the Mean and Variance of a Random Variable

  • Suppose that the observations X 1 , . . . , X n X_1 ,..., X_n X1,...,Xn are i.i.d., with an unknown common mean θ \theta θ. The most natural estimator of θ \theta θ is the sample mean:
    M n = X 1 + . . . + X n n M_n=\frac{X_1+...+X_n}{n} Mn=nX1+...+XnThis estimator is unbiased. Its mean squared error is equal to its variance, which is v / n v/n v/n, where v v v is the common variance of the X i X_i Xi. Furthermore, by the weak law of large numbers, this estimator converges to θ \theta θ in probability, and is therefore consistent.
  • Suppose that we are interested in an estimator of the variance v v v. A natural one is
    S ‾ n 2 = 1 n ∑ i = 1 n ( X i − M n ) 2 \overline S_n^2=\frac{1}{n}\sum_{i=1}^n(X_i-M_n)^2 Sn2=n1i=1n(XiMn)2which coincides with the ML estimator derived in Example 9.4 under a normality assumption. We have
    E ( θ , v ) [ S ‾ n 2 ] = 1 n E ( θ , v ) [ ∑ i = 1 n X i 2 − 2 M n ∑ i = 1 n X i + n M n 2 ] = E ( θ , v ) [ 1 n ∑ i = 1 n X i 2 − 2 M n 2 + M n 2 ] = E ( θ , v ) [ 1 n ∑ i = 1 n X i 2 − M n 2 ] = E ( θ , v ) [ X 2 ] − E ( θ , v ) [ M n 2 ] = ( θ 2 + v ) − ( θ 2 + v / n ) = n − 1 n v \begin{aligned}E_{(\theta,v)}[\overline S_n^2]&=\frac{1}{n}E_{(\theta,v)}\bigg[\sum_{i=1}^nX_i^2-2M_n\sum_{i=1}^nX_i+nM_n^2\bigg] \\&=E_{(\theta,v)}\bigg[\frac{1}{n}\sum_{i=1}^nX_i^2-2M_n^2+M_n^2\bigg] \\&=E_{(\theta,v)}\bigg[\frac{1}{n}\sum_{i=1}^nX_i^2-M_n^2\bigg] \\&=E_{(\theta,v)}[X^2]-E_{(\theta,v)}[M_n^2] \\&=(\theta^2+v)-(\theta^2+v/n) \\&=\frac{n-1}{n}v\end{aligned} E(θ,v)[Sn2]=n1E(θ,v)[i=1nXi22Mni=1nXi+nMn2]=E(θ,v)[n1i=1nXi22Mn2+Mn2]=E(θ,v)[n1i=1nXi2Mn2]=E(θ,v)[X2]E(θ,v)[Mn2]=(θ2+v)(θ2+v/n)=nn1vThus, S ‾ n 2 \overline S_n^2 Sn2 is not an unbiased estimator of v v v, although it is asymptotically unbiased. We can obtain an unbiased variance estimator after some suitable scaling. This is the estimator
    S ^ n 2 = n n − 1 S ‾ n 2 = 1 n − 1 ∑ i = 1 n ( X i − M n ) 2 \hat S_n^2=\frac{n}{n-1}\overline S_n^2=\frac{1}{n-1}\sum_{i=1}^n(X_i-M_n)^2 S^n2=n1nSn2=n11i=1n(XiMn)2

Confidence Intervals 置信区间

  • Consider an estimator Θ ^ \hat\Theta Θ^ of an unknown parameter θ \theta θ. Besides the numerical value provided by an estimate, we are often interested in constructing a so-called confidence interval. Roughly speaking, this is an interval that contains θ \theta θ with a certain high probability, for every possible value of θ \theta θ.
  • For a precise definition, let us first fix a desired confidence level, 1 − α 1-\alpha 1α, where α \alpha α is typically a small number. We then replace the point estimator Θ ^ n \hat\Theta_n Θ^n by a lower estimator Θ ^ n − \hat\Theta_n^- Θ^n, and an upper estimator Θ ^ n + \hat\Theta_n^+ Θ^n+, designed so that Θ ^ n − ≤ Θ ^ n + \hat\Theta_n^-\leq\hat\Theta_n^+ Θ^nΘ^n+, and
    P θ ( Θ ^ n − ≤ θ ≤ Θ ^ n + ) ≥ 1 − α P_\theta(\hat\Theta_n^-\leq\theta\leq\hat\Theta_n^+)\geq1-\alpha Pθ(Θ^nθΘ^n+)1αfor every possible value of θ \theta θ. Note that, similar to estimators, Θ ^ n − \hat\Theta_n^- Θ^n and Θ ^ n + \hat\Theta_n^+ Θ^n+, are functions of the observations, and hence random variables whose distributions depend on θ \theta θ. We call [ Θ ^ n − , Θ ^ n + \hat\Theta_n^-,\hat\Theta_n^+ Θ^n,Θ^n+] a 1 − α \boldsymbol{1-\alpha} 1α confidence interval.

Example 9.6.

  • Suppose that the observations X i X_i Xi are i.i.d. normal, with unknown mean θ \theta θ and known variance v v v. Then, the sample mean estimator
    Θ ^ n = X 1 + . . . + X n n \hat\Theta_n=\frac{X_1+...+X_n}{n} Θ^n=nX1+...+Xnis normal, with mean θ \theta θ and variance v / n v/n v/n.
  • Let α = 0.05 \alpha = 0.05 α=0.05. Using the CDF Φ ( z ) \Phi(z) Φ(z) of the standard normal (available in the normal tables), we have Φ ( 1.96 ) = 0.975 = 1 − α / 2 \Phi(1.96) = 0.975 = 1-\alpha/2 Φ(1.96)=0.975=1α/2 and we obtain
    P θ ( ∣ Θ ^ n − θ ∣ v / n ≤ 1.96 ) = 1 − α = 0.95 P_\theta\bigg(\frac{|\hat\Theta_n-\theta|}{\sqrt{v/n}}\leq1.96\bigg)=1-\alpha=0.95 Pθ(v/n Θ^nθ1.96)=1α=0.95We can rewrite this statement in the form
    P θ ( Θ ^ n − 1.96 v n ≤ θ ≤ Θ ^ n + 1.96 v n ) P_\theta\bigg(\hat\Theta_n-1.96\sqrt{\frac{v}{n}}\leq\theta\leq\hat\Theta_n+1.96\sqrt{\frac{v}{n}}\bigg) Pθ(Θ^n1.96nv θΘ^n+1.96nv )which implies that
    [ Θ ^ n − 1.96 v n , Θ ^ n + 1.96 v n ] \bigg[\hat\Theta_n-1.96\sqrt{\frac{v}{n}},\hat\Theta_n+1.96\sqrt{\frac{v}{n}}\bigg] [Θ^n1.96nv ,Θ^n+1.96nv ]is a 95% confidence interval.

Out of a variety of possible confidence intervals, one with the smallest possible width is usually desirable.


  • In the preceding example, we may be tempted to describe the concept of a 95% confidence interval by a statement such as “the true parameter lies in the confidence interval with probability 0.95.” Such statements, however, can be ambiguous. For example, suppose that after the observations are obtained, the confidence interval turns out to be [ − 2.3 , 4.1 ] [-2.3, 4.1] [2.3,4.1] with probability 0.95 0.95 0.95, because the latter statement does not involve any random variables; after all, in the classical approach, θ \theta θ is a constant.
  • For a concrete interpretation, suppose that θ \theta θ is fixed. We construct a confidence interval many times, using the same statistical procedure, i.e., each time, we obtain an independent collection of n n n observations and construct the corresponding 95% confidence interval. We then expect that about 95% of these confidence intervals will include θ \theta θ. This should be true regardless of what the value of θ \theta θ is.

  • The construction of confidence intervals is sometimes hard. Fortunately, for many important models, Θ ^ n − θ \hat\Theta_n-\theta Θ^nθ is asymptotically normal and asymptotically unbiased. By this we mean that the CDF of the random variable
    Θ ^ n − θ v a r θ ( Θ ^ n ) \frac{\hat\Theta_n-\theta}{\sqrt{var_\theta(\hat\Theta_n)}} varθ(Θ^n) Θ^nθapproaches the standard normal CDF as n n n increases, for every value of θ \theta θ. We may then proceed exactly as in Example 9.6, provided that v a r θ ( Θ ^ n ) var_\theta(\hat\Theta_n) varθ(Θ^n) is known or can be approximated.

Confidence Intervals Based on Estimator Variance Approximations

  • Suppose that the observations X i X_i Xi are i.i.d. with mean θ \theta θ and variance v v v that are unknown. We may estimate θ \theta θ with the sample mean
    Θ ^ n = X 1 + . . . + X n n \hat\Theta_n=\frac{X_1+...+X_n}{n} Θ^n=nX1+...+Xnand estimate v v v with the unbiased estimator
    S ^ n 2 = 1 n − 1 ∑ i = 1 n ( X i − M n ) 2 \hat S_n^2=\frac{1}{n-1}\sum_{i=1}^n(X_i-M_n)^2 S^n2=n11i=1n(XiMn)2
  • In particular, we may estimate the variance v / n v/n v/n of the sample mean by S ^ n 2 / n \hat S_n^2/n S^n2/n. Then, for a given α \alpha α, we may use these estimates and the central limit theorem to construct an (approximate) 1 − α 1 - \alpha 1α confidence interval. This is the interval
    [ Θ ^ n − z S ^ n n , Θ ^ n + z S ^ n n ] \bigg[\hat\Theta_n-z{\frac{\hat S_n}{\sqrt n}},\hat\Theta_n+z{\frac{\hat S_n}{\sqrt n}}\bigg] [Θ^nzn S^n,Θ^n+zn S^n]where z z z is obtained from the relation
    Φ ( z ) = 1 − α 2 \Phi(z)=1-\frac{\alpha}{2} Φ(z)=12αand the normal tables.

  • Note that in this approach, there are two different approximations in effect. First, we are treating Θ ^ n \hat\Theta_n Θ^n as if it were a normal random variable; second, we are replacing the true variance v / n v/n v/n of Θ ^ n \hat\Theta_n Θ^n by its estimate S ^ n 2 / n \hat S_n^2/n S^n2/n.
  • Even in the special case where the X i X_i Xi are normal random variables, the confidence interval produced by the preceding procedure is still approximate. The reason is that S ^ n 2 \hat S_n^2 S^n2 is only an approximation to the true variance v v v, and the random variable
    T n = n ( Θ ^ n − θ ) S ^ n T_n=\frac{\sqrt n(\hat\Theta_n-\theta)}{\hat S_n} Tn=S^nn (Θ^nθ)is not normal. However, for normal X i X_i Xi, it can be shown that the PDF of T n T_n Tn does not depend on θ \theta θ and v v v, and can be computed explicitly. It is called the t t t-distribution with n − 1 n - 1 n1 degrees of freedom (自由度). Like the standard normal PDF, it is symmetric and bell-shaped, but it is a little more spread out and has heavier tails. The probabilities of various intervals of interest are available in tables, similar to the normal tables.
    在这里插入图片描述在这里插入图片描述
  • Thus, when the X i X_i Xi are normal (or nearly normal) and n n n is relatively small, a more accurate confidence interval is of the form
    [ Θ ^ n − z S ^ n n , Θ ^ n + z S ^ n n ] \bigg[\hat\Theta_n-z{\frac{\hat S_n}{\sqrt n}},\hat\Theta_n+z{\frac{\hat S_n}{\sqrt n}}\bigg] [Θ^nzn S^n,Θ^n+zn S^n]where z z z is obtained from the relation
    Ψ n − 1 ( z ) = 1 − α 2 \Psi_{n-1}(z)=1-\frac{\alpha}{2} Ψn1(z)=12α
  • On the other hand, when n n n is moderately large (e.g., n ≥ 50 n\geq50 n50), the t t t-distribution is very close to the normal distribution, and the normal tables can be used.

Example 9.7.

  • The weight of an object is measured eight times using an electronic scale that reports the true weight plus a random error that is normally distributed with zero mean and unknown variance. Assume that the errors in the observations are independent. The following results are obtained:
    在这里插入图片描述
  • We compute a 95% confidence interval ( n = 0.05 n = 0.05 n=0.05) using the t t t-distribution. The value of the sample mean Θ ^ n \hat\Theta_n Θ^n is 0.5747, and S ^ n / n \hat S_n/\sqrt n S^n/n is 0.0182. From the t t t-distribution tables, we obtain 1 − Ψ ( 2.365 ) = 0.025 = α / 2 1- \Psi(2.365) =0.025 =\alpha/2 1Ψ(2.365)=0.025=α/2, so that
    [ Θ ^ n − z S ^ n n , Θ ^ n + z S ^ n n ] = [ 0.531 , 0.618 ] \bigg[\hat\Theta_n-z{\frac{\hat S_n}{\sqrt n}},\hat\Theta_n+z{\frac{\hat S_n}{\sqrt n}}\bigg]=[0.531,0.618] [Θ^nzn S^n,Θ^n+zn S^n]=[0.531,0.618]is a 95% confidence interval.

  • The approximate confidence intervals constructed so far relied on the particular estimator S ^ n 2 \hat S_n^2 S^n2 for the unknown variance v v v. However, different estimators or approximations of the variance are possible.
    • For example, suppose that the observations X 1 , . . . , X n X_1, ... , X_n X1,...,Xn are i.i.d. Bernoulli with unknown mean θ \theta θ, and variance v = θ ( 1 − θ ) v=\theta(1 -\theta) v=θ(1θ). Then, instead of S ^ n 2 \hat S_n^2 S^n2, the variance could be approximated by Θ ^ ( 1 − Θ ^ ) \hat\Theta(1 -\hat\Theta) Θ^(1Θ^). Another possibility is to just observe that θ ( 1 − θ ) ≤ 1 / 4 \theta(1 -\theta)\leq1/4 θ(1θ)1/4 for all θ ∈ [ 0 , 1 ] \theta\in [0,1] θ[0,1], and use 1/4 as a conservative estimate of the variance.
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值