《统计计算》

本文介绍了统计计算的基础概念,包括概率论、抽样方法、随机数生成等关键内容,并探讨了如何利用MATLAB进行数据探索性分析及蒙特卡洛方法在统计推断中的应用。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >







《统计计算》





Chapter 1 Introduction

What Is Computational Statistics?

An Overview of the Book

MATLAB Code

Further Reading


Chapter 2 Probability Concepts

Chapter 3 Sampling Concepts


Chapter 4 Generating Random Variables

p83

产生随机数

X = ( b − a ) ⋅ U + a X = (b-a) \cdot U + a X=(ba)U+a

Example 4.1

x = rand(1,1000);
[N,X]=hist(x,15);  % N 是频数, X 是分割点
bar(X,N,1,'w')  % 画出直方图
title('HISTOGRAM OF UNIFORM RANDOM VARIABLES')
xlabel('X')
ylabel('FREQUENCY')
image-20201217174826920

seed方法:产生随机数种子

% Generate 3 random samples of size 5.
x = zeros(3,5);   % Allocate the memory.
for i=1:3
       rand('state',i) % set the state
       x(i,:)=rand(1,5);
end

% Recover previous state.
rand('state',2)
xt = rand(1,5);
x
xt

输出:

x =

0.9528    0.7041    0.9539    0.5982    0.8407
0.8752    0.3179    0.2732    0.6765    0.0712
0.5162    0.2252    0.1837    0.2163    0.4272

xt =

0.8752    0.3179    0.2732    0.6765    0.0712

逆方法

逆方法可以用来从连续分布中产生随机变量,使用累积分布函数 F F F 为均匀分布 U ( 0 , 1 ) U(0,1) U(0,1)
U = F ( x ) X = F − 1 ( U ) U = F(x) \\ X = F^{-1} (U) U=F(x)X=F1(U)
算法:

  1. Derive the expression for the inverse distribution function F − 1 ( U ) F^{-1}(U) F1(U).
    推导逆分布函数的表达式
  2. Generate a uniform random number U U U.
    从均匀分布中产生随机数 U U U
  3. Obtain the desired X X X from X = F − 1 ( U ) X=F^{-1}(U) X=F1(U).
    X = F − 1 ( U ) X=F^{-1}(U) X=F1(U) 中得到想要的 X X X.

连续的情况:为什么从均匀分布里面产生一个随机数 U U U,然后将这个 U U U带入分布函数的反函数就能得到所需要的分布?
⇓ \Downarrow

X ∼ F ( x ) U ∼ U ( 0 , 1 ) X = F − 1 ( U ) ∼ F ( x ) 证明: F X ( x ) = P { X ≤ x } = P { F ′ ( U ) ≤ x } = P { U ≤ F ( x ) } = F ( x ) \begin{aligned} &X\sim F(x) \\ &U\sim U(0,1) \\ \\ &X=F^{-1}(U) \sim F(x)\\ \text{证明:}\\ & \begin{aligned} F_{X}(x) &= P\{X\le x\}\\ &=P\{F'(U) \le x \}\\ &= P\{ U\le F(x) \} \\ &= F(x) \end{aligned} \end{aligned} 证明:XF(x)UU(0,1)X=F1(U)F(x)FX(x)=P{Xx}=P{F(U)x}=P{UF(x)}=F(x)

离散的情况:为什么从均匀分布里面产生一个随机数 U U U,然后将这个 U U U带入 F ( x i − 1 ) < U ≤ F ( x i ) F(x_{i-1}) < U \le F(x_i) F(xi1)<UF(xi)就能得到所需要的分布?
X ∼ F ( x ) = ( x 1 x 2 ⋯ x n p 1 p 2 ⋯ p n ) U ∼ U ( 0 , 1 ) \begin{aligned} X& \sim F(x)= \begin{pmatrix} x_1 & x_2 & \cdots & x_n \\ p_1 & p_2 & \cdots & p_n \end{pmatrix} \\ U& \sim U(0,1) \end{aligned} XUF(x)=(x1p1x2p2xnpn)U(0,1)

X = F − 1 ( U ) ∼ F ( x ) ⇒ F ( X ) = U X=F^{-1}(U) \sim F(x) \qquad \Rightarrow \qquad F(X)=U X=F1(U)F(x)F(X)=U

F x ( x ) = P ( X = x i ) = P ( F − 1 ( U ) = x i ) = P ( U = F ( x i ) ) F x ( x ) = P ( U = F ( x i ) ) ⇔ F ( x i − 1 ) < U ≤ F ( x i ) \begin{aligned} F_x(x) &= P(X=x_i) \\ &= P(F^{-1}(U)=x_i) \\ &= P(U=F(x_i)) \\ \\ F_x(x)= P(U=F(x_i)) & \Leftrightarrow F(x_{i-1})<U\le F(x_i) \end{aligned} Fx(x)Fx(x)=P(U=F(xi))=P(X=xi)=P(F1(U)=xi)=P(U=F(xi))F(xi1)<UF(xi)

离散的情况:
P ( X = x i ) = p i ; x 0 < x 1 < x 2 < . . . ∑ i p i = 1 P(X=x_i) = p_i; \quad x_0 < x_1 < x_2<... \quad \sum_i p_i = 1 \\ P(X=xi)=pi;x0<x1<x2<...ipi=1
产生随机数 U U U ,推导随机数 X X X 的公式如下:
X = x i if F ( x i − 1 ) < U ≤ F ( x i ) X = x_i \quad \text{if} \quad F(x_{i-1}) < U \le F(x_i) X=xiifF(xi1)<UF(xi)

例子:

模拟一个离散的随机向量 X X X ,有概率分布列
P ( X = 0 ) = 0.3 P ( X = 1 ) = 0.2 P ( X = 2 ) = 0.5 P(X=0) = 0.3 \\ P(X=1) = 0.2 \\ P(X=2) = 0.5 P(X=0)=0.3P(X=1)=0.2P(X=2)=0.5
累积分布函数为
F ( x ) = { 0 ; x < 0 0.3 ; 0 ≤ x < 1 0.5 ; 1 ≤ x < 2 1.0 ; 2 ≤ x F(x)=\left\{\begin{array}{ll} 0 ; & x<0 \\ 0.3 ; & 0 \leq x<1 \\ 0.5 ; & 1 \leq x<2 \\ 1.0 ; & 2 \leq x \end{array}\right. F(x)=0;0.3;0.5;1.0;x<00x<11x<22x
则根据式子,可以得到产生随机数的规则为:
X = { 0 ; U ≤ 0.3 1 ; 0.3 < U ≤ 0.5 2 ; 0.5 < U ≤ 1 X=\left\{\begin{array}{ll} 0 ; & U \leq 0.3 \\ 1 ; & 0.3<U \leq 0.5 \\ 2 ; & 0.5<U \leq 1 \end{array}\right. X=0;1;2;U0.30.3<U0.50.5<U1

算法:

  1. Define a probability mass function for x i , i = 1 , . . . , k x_i,i=1,...,k xi,i=1,...,k. Note that k k k could grow infinitely.
  2. Generate a uniform random number U U U.
  3. If U ≤ p 0 U \le p_0 Up0,deliver X = x 0 X=x_0 X=x0
  4. Else if U ≤ p 0 + p 1 U \le p_0 + p_1 Up0+p1, deliver X = x 1 X=x_1 X=x1.
  5. Else if U ≤ p 0 + p 1 + p 2 U \le p_0 + p_1 + p_2 Up0+p1+p2, deliver X = x 2 X=x_2 X=x2.
  6. Else if U ≤ p 0 + p 1 + p 2 + p k U \le p_0 + p_1 + p_2 + p_k Up0+p1+p2+pk, deliver X = x k X=x_k X=xk.
n = 1000;
X=zeros(1,n);
% these are the x's in the domain
x = 0:2; 
% these are the probability masses
pr = [0.3 0.2 0.5];  
% generate 1000 random from the desired distribution
for i=1:n
   u=rand;  % generate the U
   if u<=pr(1)
      X(i)=x(1);
   elseif u<= sum(pr(1:2)) 
				% it has to be between 0.3 and 0.5
      X(i)=x(2);
  else 
      X(i)=x(3); % it has to be between 0.5 and 1
   end
end

% find the proportion of each number
x0=length(find(X==0))/n
x1=length(find(X==1))/n
x2=length(find(X==2))/n

x0 =

0.2800

x1 =

0.2200

x2 =

0.5000

(逆变换法产生随机数)

  1. 证明:设随机变量U服从(0,1)上的均匀分布,则 X = F − 1 ( U ) X=F^{-1}(U) X=F1(U)的分布函数为 F ( x ) F(x) F(x)
  2. X X X 离散分布为: P ( X = − 1 ) = 0.2 , P ( X = 1 ) = 0.2 , P ( X = 3 ) = 0.6 P(X=-1)=0.2,P(X=1)=0.2,P(X=3)=0.6 P(X=1)=0.2,P(X=1)=0.2,P(X=3)=0.6,用逆变换法给出产生随机数的程序步骤。
  3. 用Matlab或R语言给出第二题的实现代码。

接受拒绝方法

In some cased, we might have a simple method for generating a random variable from one density, say g ( y ) g(y) g(y), instead of the density we are seeking.
在某些情况下,我们可能有一种简单的方法来从一个密度(例如 g ( y ) g(y) g(y))而不是我们正在寻找的密度生成随机变量。

We can use this density to generate data from the desired continuous density f ( x ) f(x) f(x).
我们可以使用该密度从所需的连续密度$ f(x)$生成数据。

We first generate a random number γ \gamma γ from g ( y ) g(y) g(y) and accept the value with a probability proportional to the ratio f ( γ ) g ( γ ) \frac{f(\gamma)}{g(\gamma)} g(γ)f(γ).
我们首先从 g ( y ) g(y) g(y) 生成一个随机数 γ \gamma γ并以比率 f ( γ ) g ( γ ) \frac{f(\gamma)}{g(\gamma)} g(γ)f(γ) 的比例接受该值。

If we define c c c as a constant that satisfies f ( y ) g ( y ) ≤ c \frac{f(y)}{g(y)} \le c g(y)f(y)c for all y y y
如果我们定义 c c c 是一个常数,满足对于所有的 y y y,有 f ( y ) g ( y ) ≤ c \frac{f(y)}{g(y)} \le c g(y)f(y)c

Then we can generate the desired variates using the procedure outlined next.
然后,我们可以使用下面概述的过程生成所需的变量。

The constant c c c is needed because we might have to adjust the height of g ( y ) g(y) g(y) to ensure that it is above f ( y ) f(y) f(y).
需要常量 c c c,因为我们可能必须调整 g ( y ) g(y) g(y)的高度以确保其高于 f ( y ) f(y) f(y)

We generate points from c g ( y ) cg(y) cg(y), and those points that are inside the curve f ( y ) f(y) f(y) are accepted as belonging to the desired density.
我们从 c g ( y ) cg(y) cg(y)生成点,曲线 f ( y ) f(y) f(y)内的那些点被认为属于所需密度。

Those that are outside are rejected.
在外面的点是被拒绝的点。

It is beat to keep the number of rejected variated small for maximum efficiency.
击败它可以使被拒绝的变量数量保持最小,以实现最大效率。

算法:(连续)

  1. Choose a density g ( y ) g(y) g(y) that is easy to sample from.
  2. Find a constant c c c such that satisfied f ( y ) g ( y ) ≤ c \frac{f(y)}{g(y)} \le c g(y)f(y)c for all y y y
  3. Generate a random number γ \gamma γ from g ( y ) g(y) g(y).
  4. Generate a uniform random number U U U.
  5. If U ≤ f ( γ ) c g ( γ ) U \le \frac{f(\gamma)}{cg(\gamma)} Ucg(γ)f(γ), then accept X = γ X=\gamma X=γ,
    else go to step 3.

例4.4 接受拒绝方法–beta分布

Generating random variables from the beta distribution with parameters α = 2 \alpha=2 α=2 and β = 1 \beta=1 β=1.
从参数为2和1的beta分布里面产生随机数

This yields the following probability density function f ( x ) = 2 x ; 0 < x < 1 f(x) = 2x; \quad 0<x<1 f(x)=2x;0<x<1
这产生以下概率密度函数 f ( x ) = 2 x ; 0 < x < 1 f(x) = 2x; \quad 0<x<1 f(x)=2x;0<x<1

Since the domain of this density is 0 to 1, we use the uniform distribution for our g ( y ) g(y) g(y).
由于该密度的域为0到1,因此我们将均匀分布用于 g ( y ) g(y) g(y).

We must find a constant that we can use to inflate the uniform so it is above the desired beta density.
我们必须找到一个常数,我们可以用它来使均匀分布增大,使其高于所需的beta密度。

This constant is given by the maximum value of the density function, and from function f ( x ) = 2 x ; 0 < x < 1 f(x) = 2x; \quad 0<x<1 f(x)=2x;0<x<1 ,we see that c = 2 c=2 c=2.
该常数由密度函数的最大值和函数 f ( x ) = 2 x f(x) = 2x f(x)=2x得出, c = 2 c=2 c=2

The following MATLAB code generates 100 random variates from the desired distribution.
以下MATLAB代码根据所需分布生成100个随机变量。

We save both the accepted and the rejected variates for display purposes only.
我们只保存接受和拒绝的变量,仅用于显示目的。

c = 2;   % constant 
n=100;  % generate 100 rv's
% set up the arrays to store variates
x = zeros(1,n);  						% random variates
xy = zeros(1,n);						% corresponding y values
rej = zeros(1,n);						% rejected variates
rejy = zeros(1,n); % corresponding y values
irv=1;			
irej=1;
while irv <= n
   y = rand(1);  % random number from g(y)
   u = rand(1);  % random number for comparison
   if u <= 2*y/c
      x(irv)=y;
      xy(irv) = u*c;
      irv=irv+1;
   else
      rej(irej)= y;
      rejy(irej) = u*c; % really comparing u*c<=2*y
      irej = irej + 1;
   end
end

hold on
plot(x,xy,'o')
plot(rej,rejy,'*')
image-20201217201816852

算法:(离散)

  1. Choose a probability mass function q i q_i qi that is easy to sample from.
  2. Find a constant c c c such that p γ < c q γ p_{\gamma} < c q_{\gamma} pγ<cqγ.
  3. Generate a random number γ \gamma γ from the density q i q_i qi.
  4. Generate a uniform random number U U U.
  5. If U ≤ p ( γ ) c q ( γ ) U \le \frac{p(\gamma)}{cq(\gamma)} Ucq(γ)p(γ), then deliver X = γ X=\gamma X=γ,
    else go to step 3.

概率质量函数(probability mass function,简写为pmf)是离散随机变量在各特定取值上的概率。

例4.5 根据概率质量函数,用离散的接受拒绝方法去产生随机变量
P ( X = 1 ) = 0.15 P ( X = 2 ) = 0.22 P ( X = 3 ) = 0.33 P ( X = 4 ) = 0.10 P ( X = 5 ) = 0.20 P(X=1) = 0.15 \\ P(X=2) = 0.22 \\ P(X=3) = 0.33 \\ P(X=4) = 0.10 \\ P(X=5) = 0.20 P(X=1)=0.15P(X=2)=0.22P(X=3)=0.33P(X=4)=0.10P(X=5)=0.20
q γ q_{\gamma} qγ 为离散的均匀分布,取值为1,2,…,5,则其概率质量函数为 q y = 1 5 , y = 1 , . . . , 5 q_y = \frac{1}{5},\quad y = 1,...,5 qy=51,y=1,...,5.

The value for c c c is obtained as the maximum value of p y / q y p_y / q_y py/qy, which is 1.65
获得$ c 的 值 作 为 的值作为 p_y / q_y $的最大值,即1.65

This quantity is obtained by taking the maximum p y p_y py, which is P ( X = 3 ) = 0.33 P(X=3) = 0.33 P(X=3)=0.33, and dividing by 1 / 5 1/5 1/5:
通过取最大$ p_y , 即 ,即 P(X = 3)= 0.33 , 再 除 以 ,再除以 1/5 $,可以得出此数值:
m a x ( p y ) 1 / 5 = 0.33 × 5 = 1.65 \frac{max(p_y)}{1/5}=0.33 \times 5 = 1.65 1/5max(py)=0.33×5=1.65
则产生变量的伪代码为:

  1. 从离散的均匀分布中产生变量 γ \gamma γ.
    可以使用MATLAB中的函数 randi
  2. 生成均匀分布的随机数 U U U.
  3. 如果 U ≤ p γ c q γ = p γ 1.65 ⋅ 1 / 5 = p γ 0.33 U \le \frac{p_{\gamma}}{cq_{\gamma}} = \frac{p_{\gamma}}{1.65 \cdot 1/5}=\frac{p_{\gamma}}{0.33} Ucqγpγ=1.651/5pγ=0.33pγ,则令 X = γ X=\gamma X=γ,
    否则返回第1步

代码是一个练习,课后习题4.2

n = 500;   %产生500个随机数
X = zeros(1,n);   %定义X用来存放生成的随机数
x = 1:5;   %X的取值范围1,2,3,4,5
qy = 0.2;   %1/5
py = [0.15, 0.22, 0.33, 0.10, 0.20];   %各取值的概率
c = max(py)/qy;
i = 0;   %控制循环次数
while(i ~= n)
    Y = randi(5);   %随机产生1,2,3,4,5中的一个数
    u = rand;   %产生一个服从U(0,1)的随机数
    if u <= py(Y)/(c*qy)
        i = i+1;
        X(i) = Y;   %将满足条件的随机数录入X
    end
end

% 绘制X的直方图
% 后一个参数是范围,因为只有五个值,所以限定五个
[N,h]=hist(X,0:5);
bar(h,N/(n*(h(2)-h(1))),1,'w')
axis( [ 0 6 0 0.5 ] ) % 设定坐标轴范围为0<x<6,0<y<0.5
grid on % 画网格
phat = zeros(1,5);   %定义产生随机数的概率分布情况
for j = (1:5)
    phat(1,j) = length(find(X==j))/n;   %循环计算X=j时的概率
end
phat   %显示随机数的概率分布

​ (筛选法,Acceptance-Rejection Method)

  1. 若从密度函数 f ( x ) f(x) f(x)中抽样,已知是 g ( x ) g(x) g(x)是一个密度函数且易于抽样。给出筛选法抽样程序步骤。

    1. Choose a density g ( y ) g(y) g(y) that is easy to sample from.
    2. Find a constant c c c such that satisfied f ( y ) g ( y ) ≤ c \frac{f(y)}{g(y)} \le c g(y)f(y)c for all y y y
    3. Generate a random number γ \gamma γ from g ( y ) g(y) g(y).
    4. Generate a uniform random number U U U.
    5. If U ≤ f ( γ ) c g ( γ ) U \le \frac{f(\gamma)}{cg(\gamma)} Ucg(γ)f(γ), then accept X = γ X=\gamma X=γ,
      else go to step 3.
  2. f ( x ) = e − x f(x) = e^{-x} f(x)=ex,请选择 g ( x ) g(x) g(x),并用筛选法给出产生于 f ( x ) f(x) f(x)的随机数代码。

    • 选择 g ( x ) g(x) g(x) ( 0 , 50 ) (0,50) (0,50)内的均匀分布,代码如下:
    clear
    clc
    c = 50;
    n=100; 
    x = zeros(1,n); 
    irv=1;			
    while irv <= n
       y = 50*rand(1);  % random number from g(y)
       u = rand(1);  % random number for comparison
       if u <= 50 * exp(-y)/c
          x(irv)=y;
          irv=irv+1;
       end
    end
    axis = 0:0.01:50;
    fx = exp(-axis);
    hold on
    [N,h] = hist(x);
    bar(h,N/(100*(h(2)-h(1))),1,'w')
    plot(axis,fx,'r-')
    hold off
    
    image-20201221170044072

产生连续的随机数

正态分布

rand - 生成均匀分布的伪随机数
randn- 生成标准正态分布的伪随机数

如果要生成任意的正态分布,用公式 X = Z ⋅ σ + μ X=Z \cdot \sigma + \mu X=Zσ+μ 来转化

指数分布

逆方法

参数为 λ \lambda λ 的指数分布函数为 F ( x ) = 1 − e − λ x , 0 < x < ∞ F(x) = 1-e^{-\lambda x}, \quad 0<x< \infty F(x)=1eλx,0<x<

推导:
u = F ( x ) = 1 − e λ x e λ x = 1 − u − λ x = log ⁡ ( 1 − u ) x = − 1 λ log ⁡ ( 1 − u ) u = F(x) = 1-e^{\lambda x} \\ e^{\lambda x} = 1-u \\ -\lambda x = \log (1-u) \\ x = - \frac{1}{\lambda} \log (1-u) u=F(x)=1eλxeλx=1uλx=log(1u)x=λ1log(1u)
由于 ( 1 − u ) (1-u) (1u) 也是一个 ( 0 , 1 ) (0,1) (0,1)区间内的均匀分布,所以上式可以写为: x = − 1 λ log ⁡ ( u ) x = - \frac{1}{\lambda} \log (u) x=λ1log(u)

lam = 2;
n = 1000;
% Generate the random variables.
uni = rand(1,n);
X = -log(uni)/lam;

可以通过生成一系列的随机变量,画出其图像去验证是否符合指数分布。

可以通过画直方图,比较其与理论的概率密度函数

% Get the values to draw the theoretical curve.
x=0:.1:5;
% This is a function in the Statistics Toolbox.
y=exppdf(x,1/2);
% Get the information for the histogram.
[N,h]=hist(X,10);
% Change bar heights to make it correspond to
% the theoretical density - see Chapter 5.
N=N/(h(2)-h(1))/n;
% Do the plots.
bar(h,N,1,'w')
hold on
plot(x,y)
hold off
xlabel('X')
ylabel('f(x) - Exponential')
image-20201217205627321

多维正态

z z z d × 1 d \times 1 d×1维的标准正态分布向量,   u \,u u d × 1 d \times 1 d×1维的标准正态分布向量的均值, R R R d × d d \times d d×d维的矩阵,满足 R T R = Σ R^T R = \Sigma RTR=Σ

用公式 x = R T z + μ x=R^T z + \mu x=RTz+μ 来转化

MATLAB 函数 csmvrnd产生多维标准正态随机变量 X = Z R + μ T X = ZR + \mu^T X=ZR+μT

X X X x × d x\times d x×d 维的矩阵, d d d维随机变量
Z Z Z n × d n\times d n×d维矩阵,是标准正态分布
X ∼ N p ( μ , Σ ) μ p × 1   Σ p × p Z ∼ N p ( 0 , I p ) ⇒ X = R T z + μ prove E ( X ) = E ( R T Z + μ ) = R T E ( Z ) + μ = μ \begin{aligned} &X \sim N_p(\mu,\Sigma) \quad \mu_{p \times 1}\ \Sigma_{p\times p} \\ & Z \sim N_p (0,I_p) \\ \Rightarrow & X = R^T z + \mu \\ \text{prove}& \\ & E(X) = E(R^TZ + \mu) = R^T E(Z) + \mu = \mu \end{aligned} proveXNp(μ,Σ)μp×1 Σp×pZNp(0,Ip)X=RTz+μE(X)=E(RTZ+μ)=RTE(Z)+μ=μ

addpath('C:\Users\Clancey\Documents\Nutstore\我的坚果云\研究生学习\课程笔记\3统计计算与优化\CSToolbox V3')
% Generate the multivariate random normal variables.
mu = [-2;3];
covm = [1 0.7 ; 0.7 1];
X = csmvrnd(mu,covm,500);
mean(X)
corrcoef(X)

ans =

-2.0633 2.9425

ans =

1.0000    0.6977
0.6977    1.0000
function X = csmvrnd(mu,covm,n)
% CSMVRND Generate multivariate normal random variables.
%
%   R = CSMVRND(MU,COVM,N) Generates a sample of size N
%   random variables from the multivariate normal  distribution. 
%   MU is the d-dimensional mean as a column vector. 
%   COVM is the d x d covariance matrix.

if det(covm) <=0
    % Then it is not a valid covariance matrix.
    error('The covariance matrix must be positive definite')
end
%  mu(:) 将行向量转化为列向量
mu = mu(:); % Just in case it is not a column vector.
d = length(mu);
% get cholesky factorization of covariance
% 获得协方差的cholesky分解
R = chol(covm);
% generate the standard normal random variables
% 生成标准正态随机变量
Z = randn(n,d);
X = Z*R + ones(n,1)*mu';
n = 500;
mu = [-2;3];
covm = [1 0.7 ; 0.7 1];
%  mu(:) 将行向量转化为列向量
% mu = mu(:); % Just in case it is not a column vector.
d = length(mu);
% get cholesky factorization of covariance
% 获得协方差的cholesky分解
% R = chol(A) 生成一个上三角矩阵 R 使得 R'*R = A.
% 如果 A 不是正,则报错
R = chol(covm);
% generate the standard normal random variables
% 生成标准正态随机变量
Z = randn(n,d);
X = Z*R + ones(n,1)*mu';

Chapter 5 Exploratory Data Analysis

p117

直方图

A histogram is a way to graphically represent the frequency distribution of a data set.
直方图是一种以图形方式表示数据集频率分布的方式。

  • summarize a data set to understand general characteristics of the distribution such as shape, spread, or location,
    汇总数据集以了解分布的一般特征,例如形状,分布或位置,
  • suggest possible probabilistic models, or
    提示可能的概率模型
  • determine unusual behavior
    确定异常行为

   这里我把bin翻译成了区间

A frequency histogram is obtained by creating a set of bins or intervals that cover the range of the data set.
频率直方图通过创建一组覆盖数据集范围的区间或间隔来获得。

It is important that these bins do not overlap and that they have equal width.
重要的是这些区间不要重叠并且宽度相等。

We then count the number of observations that fall into each bin.
然后,我们计算落入每个区间中的观测值的数量。

To visualize this, we plot the frequency as the height of a bar, with the width of the bar representing the width of the bin.
为了可视化,我们将频率绘制为条形图的高度,条形图的宽度表示区间的宽度。

The histogram is determined by two parameters, the bin width and the starting point of the first bin.
直方图由两个参数确定,区间宽度和第一个区间的起点。

We discuss these issues in greater detail in Chapter 9.
我们将在第9章中更详细地讨论这些问题。

Relative frequency histograms are obtained by representing the height of the bin by the relative frequency of the observations that fall into the bin.
相对频率直方图通过用落入区间的观测值的相对频率表示区间的高度来获得。

The basic MATLAB package has a function for calculating and plotting a univariate histogram. This function is illustrated in the example given next.
基本的MATLAB软件包具有用于计算和绘制单变量直方图的功能。下一个示例说明了此功能。

频率直方图
相对频率直方图
密度直方图

例5.1 画出手臂的直方图

The data consist of 140 measurements of the length in inches of the forearm of adult males.
该数据包括成年男性前臂长度(以英寸为单位)的140个测量值。

load forearm
subplot(1,2,1)
% The hist function optionally returns the 
% bin centers and frequencies.
[n,x]=hist(forearm);
% Plot and use the argument of width=1
% to produce bars that touch.
bar(x,n,1);
axis square
title('Frequency Histogram')
% Now create a relative frequency histogram.
% Divide each box by the total number of points.
subplot(1,2,2)
bar(x,n/140,1)
title('Relative Frequency Histogram')
axis square
image-20201217213737347

These plots are shown in Figure 5.1
这些图如图5.1所示。

Notice that the shapes of the histograms are the same in both types of histograms, but the vertical axis is different.
请注意,两种类型的直方图的直方图形状均相同,但垂直轴不同。

From the shape of the histograms, it seems reasonable to assume that the data are normally distributed.
从直方图的形状来看,假设数据呈正态分布似乎是合理的。

One problem with using a frequency or relative frequency histogram is that they do not represent meaningful probability densities.
使用频率或相对频率直方图的一个问题是它们不代表有意义的概率密度。

The heights of the bars represent a step function that does not integrate to one.
条形图的高度表示不积分的阶跃函数。

This can be seen by superimposing a corresponding normal distribution over the relative frequency histogram as shown in Figure 5.2.
如图5.2所示,可以通过在相对频率直方图上叠加相应的正态分布来看出这一点。

image-20201217214715900

A density histogram is a histogram that has been normalized so it will integrate to one.
密度直方图是已归一化的直方图,因此它将积分为一。

This means that if we add up the areas represented by the bars, then they should add up to one.
这意味着,如果我们将条形图表示的面积相加,则它们应相加一。

A density histogram is given by the following equation:
密度直方图由以下公式给出:
f ^ ( x ) = v k n h x  in  B k \hat{f}(x) = \frac{v_k}{nh} \quad x \text{ in } B_k f^(x)=nhvkx in Bk
B k B_k Bk代表第 k k k个区间
v k v_k vk代表落入第 k k k个区间的总的数据点

we reproduce the histogram of figure 5.2 using the density histogram
我们使用密度直方图重现图5.2的直方图

load forearm
% Get parameter estimates for the normal distribution.
mu=mean(forearm);
v=var(forearm);
% Obtain normal pdf based on parameter estimates.
xp = linspace(min(forearm),max(forearm));
yp = csnormp(xp,mu,v);
% Get the information needed for a histogram.
[nu,x]=hist(forearm);
% Get the widths of the bins.
h = x(2)-x(1);
% Plot as density histogram - Equation 5.1.
bar(x,nu/(140*h),1)
hold on
plot(xp,yp)
title('Forearm Data Density Histogram and Estimate')
hold off
image-20201217215232197

结果如图所示。注意,对数据进行标准化的假设并非不合理。估计的密度函数和密度直方图非常吻合。

Quantile-Based Plots ---- Continuous Distributions

基于分位数的图----连续分布

If we need to compare two distributions, then we can use the quantile plot to visually compare them.
如果我们需要比较两个分布,则可以使用分位数图直观地比较它们。

This is also applicable when we want to compare a distribution and a sample or to compare two samples.
当我们要比较分布和样本或比较两个样本时,这也适用。

In comparing the distributions or samples, we are interested in knowing how they are shifted relative to each other.
在比较分布或样本时,我们有兴趣知道它们如何相对于彼此移动。

In essence, we want to know if they are distributed in the same way.
本质上,我们想知道它们是否以相同的方式分布。

This is important when we are trying to determine the distribution that generated our data, possibly with the goal of using that information to generate data for Monte Carlo simulation.
重要的是,当我们试图确定生成数据的分布时,我们可能想要使用该信息生成用于蒙特卡洛模拟的数据。

Another application where this is useful is in checking model assumptions, such as normality , before we conduct our analysis.
另一个应用是在进行分析之前检查模型假设(例如正态性)。

In this part, we discuss several versions of quantile-based plots.
在这一部分中,我们讨论基于分位数的图的几种版本。

These include quantile-quantile plots (q-q plots) and quantile plots (sometimes called a probability plot).
这些包括分位数-分位数图(q-q图)和分位数图(有时称为概率图)。

Quantile plots for discrete variables are discussed next.
接下来讨论离散变量的分位数图。

The quantile plot is used to compare a sample with a theoretical distribution.
分位数图用于比较样本的理论分布。

Typically, a q-q plot (sometimes called an empirical quantile plot) is used to determine whether two random samples are generated by the same distribution.
通常,使用q-q图(有时称为经验分位数图)来确定是否通过相同的分布生成了两个随机样本。

It should be noted that the q-q plot can also be used to compare a random sample with a theoretical distribution by generating as ample from the theoretical distribution as the second sample.
应当注意,q-q图还可以通过从理论分布中生成与第二个样本相同的值来比较随机样本和理论分布。

Q-Q Plot

The q-q plot was originally proposed by Wilk and Gnanadesikan [1968] to visually compare two distributions by graphing the quantiles of one versus the quantiles of the other.
q-q图最初是由Wilk和Gnanadesikan [1968]提出的,通过绘制一个分布点的分位数与另一个分布点的图来直观地比较两个分布

Say we have two data sets consisting of univariate measurements.
假设我们有两个由单变量测量组成的数据集。

We denote the order statistics for the first data set by
我们将第一个数据集的次序统计信息表示为
x ( 1 ) , x ( 2 ) , . . . , x ( n ) x_{(1)},x_{(2)},...,x_{(n)} x(1),x(2),...,x(n)
Let the order statistics for the second date set be
y ( 1 ) , y ( 2 ) , . . . , y ( m ) y_{(1)},y_{(2)},...,y_{(m)} y(1),y(2),...,y(m)
with m ≤ n m\le n mn.

We look first at the case where the sizes of the data sets are equal, so m = n m = n m=n.
我们首先来看数据集大小相等的情况,因此$ m = n $。

In this case, we plot as points the sample quantiles of one data set versus the other data set.
在这种情况下,我们将一个数据集相对于另一数据集的样本分位数绘制为点。

This is illustrated in Example 5.4.
例5.4中对此进行了说明。

If the data sets come from the same distribution, then we would expect the points to approximately follow a straight line.
如果数据集来自相同的分布,则我们希望这些点大致沿着一条直线。

A major strength of the quantile-based plots is that they do not require the two samples (or the sample and theoretical distribution) to have the same location and scale parameter.
基于分位数的图的主要优势在于,它们不需要两个样本(或样本和理论分布)具有相同的位置和比例参数。

If the distributions are the same, but differ in location or scale, then we would still expect the quantile-based plot to produce a straight line.
如果分布相同,但是位置或比例不同,那么我们仍然希望基于分位数的图能够产生一条直线。

例5.4

We will generate two sets of normal random variables and construct a q-q plot.
我们将生成两组正态随机变量并构建一个q-q图。

The q-q plot (Figure 5.6) follows a straight line (approximately), indicating that the samples come from the same distribution.
q-q图(图5.6)遵循一条直线(近似),表明样本来自相同的分布。

randn('state',1) % 设置seed
x = randn(1,75);
y = randn(1,75);
% Find the order statistics.
xs = sort(x);
ys = sort(y);
% Now construct the q-q plot.
plot(xs,ys,'o')
xlabel('X - Standard Normal')
ylabel('Y - Standard Normal')
axis equal  % 取消系统自动拉长x轴或y轴
image-20201218134238042

If we repeat the above MATLAB commands using a data set generated from an exponential distribution and one that is generated from the standard normal, then we have the plot shown in Figure 5.7. Note that the points in this q-q plot do not follow a straight line, leading us to conclude that the data are not generated from the same distribution.

如果我们使用从指数分布生成的数据集和从标准正态生成的数据集重复上述MATLAB命令,那么图5.7所示。
请注意,此q-q图中的点不是直线,这使我们得出结论,数据不是从同一分布生成的。

We now look at the case where the sample sizes are not equal.
现在我们来看样本大小不相等的情况。

Without loss of generality, we assume that m < n m<n m<n.
在不失一般性的前提下,我们假定$ m <n $。

To obtain the q-q plot, we graph the y ( i ) , i = 1 , 2 , . . . , m y_{(i)},i=1,2,...,m y(i),i=1,2,...,m against the ( i − 0.5 ) / m (i-0.5)/m (i0.5)/m quantile of the other data set.
为了获得q-q图,我们将$ y_{(i)},i = 1,2,…,m 与 其 他 数 据 集 的 与其他数据集的 (i-0.5)/ m $分位数作图。

Note that this definition is not unique, as values other than 0.5 0.5 0.5 can be used [Cleveland, 1993].
注意,这个定义不是唯一的,因为可以使用$ 0.5 $以外的值[Cleveland,1993]。

The ( i − 0.5 ) / m (i-0.5)/m (i0.5)/m quantiles of the x x x data usually obtained via interpolation, and we show in the next example how to use the function csquantiles to get the desired plot.
通常通过插值获得$ x 数 据 的 数据的 (i-0.5)/ m $分位数,我们在下一个示例中显示如何使用函数csquantiles获得所需的图。

Users should be aware that q-q plots provide a rough idea of how similar the distribution is between two random samples.
应注意,q-q图可以粗略了解两个随机样本之间的分布有多相似。

If the sample sizes are small, then a lot of variation is expected, so comparisons might be suspect.
如果样本量很小,那么预计会有很大的差异,因此可能会产生比较怀疑。

To help aid the visual comparison, some q-q plots include a reference line.
为了帮助进行视觉比较,一些q-q图包括参考线。

These are lines that are estimated using the first and third quartiles ( q 0.25 , q 0.75 q_{0.25}, q_{0.75} q0.25,q0.75) of each data set.
这些行是使用每个数据集的第一个和第三个四分位数($ q_ {0.25},q_ {0.75} $)估算的。

The line is added to the graphic and plotted to cover the range of the data.
将该线添加到图形并进行绘制以覆盖数据范围。

The MATLAB Statistics Toolbox provides a function called qqplot that displays this type of plot.
MATLAB Statistics Toolbox提供了一个名为qqplot的函数,用于显示这种类型的图。

We show next how to add the line.
接下来,我们演示如何添加行。

例5.5

This example shows how to do a q-q plot when the samples do not have the same number of points.
此示例显示了当样本的点数不同时如何绘制q-q图。

We use the function csquantiles to get the required sample quantiles from the data set that has the larger sample size.
我们使用csquantiles函数从具有较大样本量的数据集中获取所需的样本分位数。

We then plot these versus the order statistics of the other sample, as we did in the previous examples.
然后,像在前面的示例中所做的那样,将这些数据与另一个样本的顺序统计信息进行比较。

Note that we add a reference line based on the first and third quartiles of each data set, using the function polyfit (see Chapter 8 for more information on this function).
注意,我们使用polyfit函数根据每个数据集的第一和第三分位数添加参考线(有关此函数的更多信息,请参见第8章)。

randn('state',1)
m = 50;
x = randn(1,75);
y = randn(1,m);
% find the order statistics for y
ys = sort(y);
% Now find the associated quantiles using the x
p = ((1:m) - 0.5)/m; % probabilities for quantiles
xs = csquantiles(x,p);
% Construct the plot
plot(xs,ys,'ko')
% Get the reference line.
% Use the 1st and 3rd quartiles of each set to
% get a line.
qy = csquantiles(y,[0.25,0.75]);
qx = csquantiles(x,[0.25,0.75]);
[pol, s] = polyfit(qx,qy,1);
% Add the line to the figure.
yhat = polyval(pol,xs);
hold on
plot(xs,yhat,'k')
xlabel('Sample Quantiles - X'),
ylabel('Sorted Y Values')
hold off
image-20201218140314582

上面那个函数是书里面自带的,作者写的程序,考试应该要用源代码,允许使用插值函数interp1

m = 50;
x = randn(1,75);
y = randn(1,m);
% find the order statistics for y
ys = sort(y);

p = ((1:m) - 0.5)/m;
%xs = csquantiles(x,p);
xs=sort(x);
qhat = zeros(size(p));
n=length(x);
phat = ((1:n)-0.5)/n;
qhat=interp1(phat,xs,p); % 插值函数
plot(qhat,ys,'ko')
image-20201220155116583

Quantile Plots 分位数图

A quantile plot or probability plot is one where the theoretical quantiles are plotted against the order statistics for the sample.
分位数图或概率图是相对于样本的顺序统计量绘制理论分位数的图.

Thus, on one axis we plot the x ( i ) x_{(i)} x(i) and on the other axis we plot F − 1 ( i − 0.5 n ) ) F^{-1}\left( \frac{i-0.5}{n}) \right) F1(ni0.5)) , where F − 1 ( . ) F^{-1}(.) F1(.) denotes the inverse of the cumulative distribution function for the hypothesized distribution.
因此,在一个轴上我们绘制 x ( i ) x_{(i)} x(i) ,在另一轴上我们绘制 F − 1 ( i − 0.5 n ) ) F^{-1}\left( \frac{i-0.5}{n}) \right) F1(ni0.5)) ,其中 F − 1 ( . ) F^{-1}(.) F1(.) 表示假设分布的累积分布函数的反函数。

As before, the 0.5 0.5 0.5 in the above argument can be different [Cleveland, 1933].
和以前一样,上述参数中的$ 0.5 $可以不同[Cleveland,1933]。

A well-known example of a quantile plot is the normal probability plot, where the ordered example versus the quantiles of the normal distribution are plotted.
分位数图的一个著名示例是“正态概率图”,其中绘制了有序样本与正态分布的分位数之间的关系。

The MATLAB Statistics Toolbox has several functions for obtaining quantile plots.
MATLAB Statistics Toolbox具有获取分位数图的几种功能。

One is called normplot, and it produces a normal probability plot.
一个叫做normplot,它产生一个正态概率图。

So, if one would like to assess the assumption that a data set comes from a normal distribution, then this is the one to use.
因此,如果您想评估数据集来自正态分布的假设,那么就可以使用它。

Another function is called probplot. Using this function, one can construct probability plots for several distributions, such as the exponential, lognormal, normal, and others.
另一个函数称为**probplot **。使用此函数,可以构造几种分布的概率图,例如指数分布,对数正态分布,正态分布和其他分布。

There is also a function for constructing a quantile plot that compares a data set to the Weibull distribution. This is called wb1plot.
还有一个用于构建将数据集与Weibull分布进行比较的分位数图的功能。这称为wb1plot

For quantile plots with other theoretical distributions, one can use the MATLAB code given next, substituting the appropriate function to get the theoretical quantiles.
对于具有其他理论分布的分位数图,可以使用下面给出的MATLAB代码,代之以适当的函数以获得理论分位数。

例5.6

This example illustrates how you can display a quantile plot in MATLAB. We first generate a random sample from the standard normal distribution as our data set. The sorted sample is an estimate of the ( i − 0.5 ) / n (i- 0.5)/n (i0.5)/n quantile, so we next calculate these probabilities and get the corresponding theoretical quantiles. Finally, we use the function norminv from the Statistics Toolbox to get the theoretical quantiles for the normal distribution. The resulting quantile plot is shown in Figure 5.9.
本示例说明了如何在MATLAB中显示分位数图。
我们首先从标准正态分布中生成一个随机样本作为我们的数据集。
排序后的样本是$(i- 0.5)/ n $分位数的估计,因此我们接下来计算这些概率并获得相应的理论分位数。
最后,我们使用统计工具箱中的norminv函数来获得正态分布的理论分位数。
产生的分位数图如图5.9所示。

x = randn(1,100);
% Get the probabilities.
prob = ((1:100)-0.5)/100;
% Now get the theoretical quantiles.
qp = norminv(prob,0,1);  % 反求分位数
% Now plot theoretical quantiles versus 
% the sorted data.
plot(sort(x),qp,'ko') % x需要排序,y已经排序好了
xlabel('Sorted Data')
ylabel('Standard Normal Quantiles')
image-20201218140701083

To further illustrate these concepts, let’s see what happens when we generate a random sample from a uniform (0, 1) distribution and check it against the normal distribution. The MATLAB code is given next, and the quantile plot is shown in Figure 5.10. As expected, the points do not lie on a line, and we see that the data are not from a normal distribution.

为了进一步说明这些概念,让我们看看从均匀(0,1)分布生成随机样本并将其与正态分布进行对照时会发生什么。
接下来给出MATLAB代码,分位数图如图5.10所示。
不出所料,这些点不在一条线上,并且我们看到数据不是来自正态分布。

% Generate a random sample from a 
% uniform distribution.
x = rand(1,100);
% Get the probabilities.
prob = ((1:100)-0.5)/100;
% Now get the theoretical quantiles.
qp = norminv(prob,0,1);
% Now plot theoretical quantiles versus 
% the sorted data.
plot(sort(x),qp,'ko')
ylabel('Standard Normal Quantiles')
xlabel('Sorted Data')
image-20201218140825120

Quantile Plots ---- Discrete Distributions 分位数图----离散分布

Previously, we described quantile plots that are primarily used for continuous data. We would like to have a similar technique for graphically comparing the shapes of discrete distributions. Hoaglin and Tukey [1985] developed several plots to accomplish this. We present two of them here: the Poissonness plot and the binomialness plot. These will enable us to search for evidence that our discrete data follow a Poisson or a binomial distribution. They also serve to highlight which points might be incompatible with the model.

之前,我们描述了主要用于连续数据的分位数图。
我们希望有一种类似的技术,以图形方式比较离散分布的形状。
Hoaglin和Tukey [1985]开发了几个图来实现这一目标。
我们在这里介绍其中两个:泊松图和二项式图。
这些将使我们能够搜索证据,证明我们的离散数据遵循泊松分布或二项分布。
它们还有助于突出显示哪些点可能与模型不兼容。

Poissonness Plot

泊松图,查看一个样本是否是泊松分布

Typically, discrete data are whole number values that are often obtained by counting the number of times something occurs.
通常,离散数据是整数值,通常可以通过计算发生某事的次数来获得。

For example, these might be the number of traffic fatalities, the number of school-age children in a household, the number of defects on a hard drive, or the number of errors in a computer program.
例如,这些可能是交通事故死亡人数,家庭中学龄儿童的人数,硬盘驱动器上的缺陷数或计算机程序中的错误数。

We sometimes have the data in the form of a frequency distribution that lists the possible count values (e.g., 0,1,2,…) and the number of observations that are equal to the count values.
有时我们会以频率分布的形式获取数据,其中列出了可能的计数值(例如0、1、2,…)以及与计数值相等的观测值数量。

The counts will be denoted as k, with k =0,1,…L.
We will assume that L is the maximum observed value for our discrete variable or counts in the data set and that we are interested in all counts between 0 and L.
计数将表示为k,其中k = 0,1,… L。
我们将假定L是离散变量或数据集中计数的最大观测值,并且我们对0到L之间的所有计数都感兴趣。

Thus, the total number of observations in the sample is N = ∑ k = 0 L n k N = \sum_{k=0}^L n_k N=k=0Lnk
因此,样本中的观测总数为 N = ∑ k = 0 L n k N = \sum_{k=0}^L n_k N=k=0Lnk

Where n k n_k nk represents the number of observations that are equal to the count k k k.
n k n_k nk代表此数为 k k k的时候的观测的数目

A basic Poissonness plot is constructed by plotting the count k k k on the horizontal axis and
φ ( k ) = ln ⁡ ( k ! ⋅ n k N ) \varphi(k) = \ln \left(\frac{k! \cdot n_k}{N}\right) φ(k)=ln(Nk!nk)
n k N = P ( X = K ) = λ k k ! e − λ \frac{n_k}{N}=P(X=K)=\frac{\lambda^k}{k!}e^{-\lambda} Nnk=P(X=K)=k!λkeλ

k ! ⋅ n k N = λ k e − λ k! \cdot \frac{n_k}{N}=\lambda^k e^{-\lambda} k!Nnk=λkeλ

φ ( n k ) = ln ⁡ ( k ! ⋅ n k N ) = k ⋅ ln ⁡ λ − λ = a ⋅ k + b \varphi(n_k)= \ln \left(\frac{k! \cdot n_k}{N}\right) = k \cdot \ln \lambda - \lambda = a \cdot k + b φ(nk)=ln(Nk!nk)=klnλλ=ak+b

上述公式的解释:用频率 n k N \frac{n_k}{N} Nnk来近似概率,然后把概率密度函数中的 k ! k! k!提取到左边,再对其求对数,就可以把关于 n k n_k nk的方程转化为关于 λ \lambda λ的线性方程,如果对于每一个 k k k,所产生的 n k n_k nk都服从泊松分布,则根据数据划出来的点是近似为一条直线,直线的斜率为 ln ⁡ λ \ln \lambda lnλ,截距为 − λ -\lambda λ.

⇑ \Uparrow

为什么 Poissonness Plot 是直线的解释.

例 5.7
Number of Occurrences   Number of Blocks of the Word may  ( k ) ( n k ) 0 156 1 63 2 29 3 8 4 4 5 1 6 1 \begin{array}{cc} \hline \text{Number of Occurrences }& \text{ Number of Blocks} \\ \text{of the Word may } \\ (k) & (n_k) \\ \hline 0 & 156 \\ 1 & 63 \\ 2 & 29 \\ 3 & 8 \\ 4 & 4 \\ 5 & 1 \\ 6 & 1 \\ \hline \end{array} Number of Occurrences of the Word may (k)0123456 Number of Blocks(nk)15663298411

k=0:6;  % vector of counts
n_k = [156 63 29 8 4 1 1];
N=sum(n_k);
% get vector of factorials
fact=zeros(size(k));
for i=k
   fact(i+1)=factorial(i);  % 阶乘
end
% get phi(n_k) for plotting
phik=log(fact.*n_k/N);   % 公式
% find the counts that are equal to 1
% plot these with the symbol 1
% plot rest with a symbol
ind=find(n_k~=1);  % 将极端情况n_k=1找出来,极端情况容易使得样本异常
plot(k(ind),phik(ind),'o')
ind=find(n_k==1);
if ~isempty(ind)
   text(k(ind),phik(ind),'1') % 把空的指标的标签弄为1
end
% add some whitespace to see better
axis([-0.5 max(k)+1 min(phik)-1 max(phik)+1])
xlabel('Number of Occurrences - k')
ylabel('\phi (n_k)')
image-20201218141740001

泊松分布的纠偏
n k ∗ = { n k − 0.67 − 0.8 n k / N ; n k ≥ 2 1 / e ; n k = 1  undefined;  n k = 0 (5.3) n_{k}^{*}=\left\{\begin{array}{ll} n_{k}-0.67-0.8 n_{k} / N ; & n_{k} \geq 2 \\ 1 / e ; & n_{k}=1 \\ \text { undefined; } & n_{k}=0 \end{array}\right. \tag{5.3} nk=nk0.670.8nk/N;1/e; undefined; nk2nk=1nk=0(5.3)

Binomialness Plot

二项分布图
φ ( n k ∗ ) = ln ⁡ ( n k ∗ N × C n k ) \varphi(n_k^*) = \ln \left( \frac{n_k^*}{N \times C_n^k} \right) φ(nk)=ln(N×Cnknk)
X ∼ B ( n , p ) X\sim B(n,p) XB(n,p)

P ( X = k ) = C n k p k ( 1 − p ) n − k P(X=k) = C_n^k p^k (1-p)^{n-k} P(X=k)=Cnkpk(1p)nk

n k N = C n k p k ( 1 − p ) n − k \frac{n_k}{N}=C_n^k p^k (1-p)^{n-k} Nnk=Cnkpk(1p)nk

n k N × C n k = p k ( 1 − p ) n − k \frac{n_k}{N \times C_n^k}=p^k (1-p)^{n-k} N×Cnknk=pk(1p)nk

φ ( n k ) = ln ⁡ ( n k N × C n k ) = k ln ⁡ p + ( n − k ) ln ⁡ ( 1 − p ) = ( ln ⁡ p + ln ⁡ ( 1 − p ) ) k + n ln ⁡ ( 1 − p ) = a ⋅ k + b \begin{aligned}\varphi(n_k) &= \ln \left( \frac{n_k}{N \times C_n^k} \right)\\ &=k \ln p + (n-k) \ln(1-p)\\&=(\ln p + \ln(1-p))k + n\ln(1-p)\\&=a\cdot k + b\end{aligned} φ(nk)=ln(N×Cnknk)=klnp+(nk)ln(1p)=(lnp+ln(1p))k+nln(1p)=ak+b

对上述公式的解释:如果一个随机变量 X X X 服从二项分布,则用频率 n k N \frac{n_k}{N} Nnk 来近似概率,把概率中的组合公式 C n k C_n^k Cnk挪到频率一边,然后两边取对数,则关于 n k n_k nk的公式 φ ( k ) \varphi(k) φ(k)就可以写成关于k的线性函数的形式,如果对每一个 k k k,数据满足二项分布,则以 k k k为横坐标, φ ( k ) \varphi(k) φ(k)为纵坐标的图像是一条直线。

⇑ \Uparrow

为什么 Binomialness Plot 是一条直线

例 5.9
Number of Females  Number of Blocks ( k ) ( n k ) 0 1 1 2 2 4 3 23 4 25 5 19 6 18 7 5 8 1 9 1 10 0 \begin{array}{cc} \hline \text{Number of Females}& \text{ Number of Blocks} \\ (k) & (n_k) \\ \hline 0 & 1 \\ 1 & 2 \\ 2 & 4 \\ 3 & 23 \\ 4 & 25 \\ 5 & 19 \\ 6 & 18 \\ 7 & 5 \\ 8 & 1 \\ 9 & 1 \\ 10 & 0 \\ \hline \end{array} Number of Females(k)012345678910 Number of Blocks(nk)124232519185110

k=0:9;
n=10;
n_k=[1 3 4 23 25 19 18 5 1 1];
N=sum(n_k);
nCk=zeros(size(k));
for i=k
   nCk(i+1)=nchoosek(n,i); % 从n里面找k个C_n^k
end
phat=n_k/N;
nkstar=n_k-0.67-0.8*phat;  % 纠偏
% Find the frequencies that are 1; nkstar=1/e.
ind1=find(n_k==1);
nkstar(ind1)= 1/2.718;  % 纠偏
% Get phi(n_k) for plotting.
phik=log(nkstar./(N*nCk));  % 计算phi(k)
% Find the counts that are equal to 1.
ind=find(n_k~=1);
plot(k(ind),phik(ind),'o') % 把不等于1 的点画上圆圈
if ~isempty(ind1)
   text(k(ind1),phik(ind1),'1') % 把等于1的点标上标签1
end
% Add some whitespace to see better.
axis([-0.5 max(k)+1 min(phik)-1 max(phik)+1]) % 规定坐标轴范围
xlabel('Number of Females - k')
ylabel('\phi (n^*_k)')
image-20201218154951807
Bivariate Histogram

二元直方图
f ^ ( x ) = v k n h 1 h 2 x  in  B k \hat{f}(x) = \frac{v_k}{n h_1 h_2} \quad x \text{ in } B_k f^(x)=nh1h2vkx in Bk
v k v_k vk 代表观测数据落入二元区间 B k B_k Bk内的数目

h i h_i hi 是区间的宽度

Scatterplots

Andrews Curves

Andrew曲线的定义为:
f x ( t ) = x 1 / 2 + x 2 sin ⁡ t + x 3 cos ⁡ t + x 4 sin ⁡ 2 t + x 5 cos ⁡ 2 t + . . . (5.9) f_x(t) = x_1 / \sqrt{2} + x_2 \sin t + x_3 \cos t + x_4 \sin 2t + x_5 \cos 2t + ... \tag{5.9} fx(t)=x1/2 +x2sint+x3cost+x4sin2t+x5cos2t+...(5.9)
利用傅里叶展开,然后用线与线之间的形状来判定

  • 相同的类其线的形状相似
  • 不同的类其线的形状不同

Andrews curves [Andrews, 1972] were developed as a method for visualizing multi-dimensional data by mapping each observation onto a function.
安德鲁斯曲线[Andrews,1972]为通过将每个观测值映射到函数上来可视化多维数据的方法。

This is similar to star plots or Chernoff faces in that each observation or sample point is represented by a glyph, except that in this case the glyph is a curve.
这类似于星形图或Chernoff面,因为每个观察点或采样点都由一个字形表示,但在这种情况下,字形是曲线。

The function for Andrews curves is defined as
Andrews曲线的函数定义为
f x ( t ) = x 1 / 2 + x 2 sin ⁡ t + x 3 cos ⁡ t + x 4 sin ⁡ 2 t + x 5 cos ⁡ 2 t + . . . (5.9) f_x(t) = x_1 / \sqrt{2} + x_2 \sin t + x_3 \cos t + x_4 \sin 2t + x_5 \cos 2t + ... \tag{5.9} fx(t)=x1/2 +x2sint+x3cost+x4sin2t+x5cos2t+...(5.9)
where the range of t is given by − π ≤ t ≤ π -\pi\le t \le \pi πtπ.

范围是 − π ≤ t ≤ π -\pi\le t \le \pi πtπ.

Each observation is projected onto a set of orthogonal basis functions represented by sines and cosines and then plotted.
每个观测值都投影到由正弦和余弦表示的一组正交基函数上,然后进行绘制。

Thus, each sample point is now represented by a curve given by Equation 5.9.
因此,现在每个采样点都由公式5.9给出的曲线表示。

We illustrate how to get the Andrews curves in the following example.
在以下示例中,我们说明如何获取安德鲁斯曲线。

例 5.23
x 1 = ( 2 , 6 , 4 ) f x 1 ( t ) = 2 / 2 + 6 sin ⁡ t + 4 cos ⁡ t x 2 = ( 5 , 7 , 3 ) f x 1 ( t ) = 5 / 2 + 7 sin ⁡ t + 3 cos ⁡ t x 3 = ( 1 , 8 , 9 ) f x 1 ( t ) = 1 / 2 + 8 sin ⁡ t + 9 cos ⁡ t x_1 = (2,6,4) \quad f_{x_1}(t)= 2/ \sqrt{2} + 6 \sin t + 4 \cos t \\ x_2 = (5,7,3) \quad f_{x_1}(t)= 5/ \sqrt{2} + 7 \sin t + 3 \cos t \\ x_3 = (1,8,9) \quad f_{x_1}(t)= 1/ \sqrt{2} + 8 \sin t + 9 \cos t \\ x1=(2,6,4)fx1(t)=2/2 +6sint+4costx2=(5,7,3)fx1(t)=5/2 +7sint+3costx3=(1,8,9)fx1(t)=1/2 +8sint+9cost

% Get the domain.
t = linspace(-pi,pi);
% Evaluate function values for each observation.
f1 = 2/sqrt(2)+6*sin(t)+4*cos(t);
f2 = 5/sqrt(2)+7*sin(t)+3*cos(t);
f3 = 1/sqrt(2)+8*sin(t)+9*cos(t);
plot(t,f1,'.',t,f2,'*',t,f3,'o')
legend('F1','F2','F3')
xlabel('t')
image-20201218162725629

例 5.24

load iris
% 定义将要绘制的域。
theta=(-pi+eps):0.1:(pi-eps);
% 每一个特征的数目
n = length(setosa);
% 每一个特征的维数
p = 4;
ysetosa = zeros(n,p);       
% 绘制n条曲线,每个数据点一条。
yvirginica=zeros(n,p);
% 取每行的点积进行观察。
ang = zeros(length(theta),p);   
fstr = '[1/sqrt(2) sin(i) cos(i) sin(2*i)]';
k = 0;
% 在每个角度theta上计算sin和cos函数。
for i=theta
   k=k+1;
   ang(k,:)=eval(fstr);
end
% 现在为每个观察值生成一个“ y”。
for i=1:n 
    % 一共有50条线
    % 即每一个分类有50条线
  for j=1:length(theta)
     % 通过观察找到点积。
     % 第i条线的第j个值
     % 遍历j,得到第i条线
     ysetosa(i,j)=setosa(i,:)*ang(j,:)';    
     yvirginica(i,j)=virginica(i,:)*ang(j,:)';
  end
end
hold
for i=1:n
  plot(theta,ysetosa(i,:),'r',...
				theta,yvirginica(i,:),'b-.')
end
legend('Iris Setosa','Iris Virginica')
hold off
title('Andrews Plot')
xlabel('t')
ylabel('Andrews Curve')
image-20201218162943218

Chapter 6 Finding Structure

Chapter 7 Monte Carlo Methods for Inferential Statistics

p229

Methods in inferential statistics are used to draw conclusions about a population and to measure the reliability of these conclusions using information obtained from a random sample.
推论统计中的方法用于得出有关总体的结论,并使用从随机样本中获得的信息来衡量这些结论的可靠性。

Inferential statistics involves techniques such as estimating population parameters using point estimates,calculating confidence interval estimates for parameters, hypothesis testing,and modeling (e.g., regression and density estimation).
推论统计涉及一些技术,例如使用点估计来估计总体参数,计算参数的置信区间估计,假设检验和建模(例如回归和密度估计)。

To measure there liability of the inferences that are made, the statistician must understand the distribution of any statistics that are used in the analysis.
为了在此衡量所做推断的责任,统计人员必须了解分析中使用的所有统计信息的分布。

In situations where we use a well-understood statistic, such as the sample mean, this is easily done analytically.
在我们使用易于理解的统计数据(例如样本均值)的情况下,这很容易通过分析来完成。

However, in many applications, we do not want to be limited to using such simple statistics or to making simplifying assumptions.
但是,在许多应用中,我们不希望局限于使用这种简单的统计信息或进行简化的假设。

The goal of this chapter is to explain how simulation or Monte Carlo methods can be used to make inferences when the traditional or analytical statistical methods fail.
本章的目的是说明当传统或分析统计方法失败时,如何使用模拟或蒙特卡洛方法进行推断。


According to Murdoch [2000], the term Monte Carlo originally referred to simulations that involved random walks and was first used by Jon von Neumann and S. M. Ulam in the 1940s.
根据默多克[2000]的说法,蒙特卡洛一词最初是指涉及随机游走的模拟,并于1940年代首次由乔恩·冯·诺伊曼和S. M.乌兰姆使用。

Today, the Monte Carlo method refers to any simulation that involves the use of random numbers.
如今,蒙特卡洛方法涉及任何涉及使用随机数的模拟。

In the following sections, we show that Monte Carlo simulations (or experiments)are an easy and inexpensive way to understand the phenomena of interest[Gentle, 1998].
在以下各节中,我们将证明蒙特卡洛模拟(或实验)是一种易于理解且不昂贵的方法,可以了解感兴趣的现象[Gentle,1998]。

To conduct a simulation experiment, you need a model that represents your population or phenomena of interest and a way to generate random numbers (according to your model) using a computer.
要进行模拟实验,您需要一个代表您的总体或感兴趣现象的模型,以及一种使用计算机生成随机数(根据模型)的方法。

The data that are generated from your model can then be studied as if they were observations.
然后,可以像从观测值一样研究从模型生成的数据。

As we will see, one can use statistics based on the simulated data(means, medians, modes,variance, skewness, etc.) to gain understanding about the population.
正如我们将看到的,可以使用基于模拟数据(均值,中位数,众数,方差,偏度等)的统计信息来了解总体。


In Section 7.2, we give a short overview of methods used in classical inferential statistics, covering such topics as hypothesis testing, power, and confidence intervals.
在第7.2节中,我们简要概述了经典推论统计中使用的方法,涵盖了假设检验,功效和置信区间等主题。

The reader who is familiar with these may skip this section.
熟悉这些内容的读者可以跳过本节。

In Section 7.3, we discuss Monte Carlo simulation methods for hypothesis testing and for evaluating the performance of the tests.
在第7.3节中,我们讨论了用于假设检验和评估检验性能的蒙特卡洛模拟方法。

The bootstrap method for estimating the bias and variance of estimates is presented in Section 7.4.
7.4节介绍了用于估计估计值的偏差和方差的bootstrap方法。

Finally, Sections 7.5 and 7.6 conclude the chapter with information about available MATLAB R code and references on Monte Carlo simulation and the bootstrap.
最后,第7.5和7.6节在本章结束时介绍了有关可用MATLAB R代码的信息以及有关Monte Carlo模拟和bootstrap的参考。

Classical Inferential Statistics

这部分的可以看统计学的相关书籍,不赘述

Hypothesis Testing

  • A transportation official in the Washington, D.C. area thinks that the mean travel time to work for northern Virginia residents has increased from the average time it took in 1995.
    华盛顿特区的一名运输官员认为,弗吉尼亚北部居民上班的平均旅行时间比1995年的平均时间有所增加。
  • A medical researcher would like to determine whether aspirin decreases the risk of heart attacks.
    医学研究人员想确定阿司匹林是否可以降低心脏病发作的风险。
  • A pharmaceutical company needs to decide whether a new vaccine is superior to the one currently in use.
    一家制药公司需要确定一种新疫苗是否优于当前使用的疫苗。
  • An engineer has to determine whether there is a difference inaccuracy between two types of instruments.
    工程师必须确定两种仪器之间的准确性是否存在差异。

Steps of hypothesis testing

  1. Determine the null and alternative hypotheses, using mathematical expressions if applicable.
    确立原假设和备择假设,使用数学表达式确定原假设和替代假设。
    Usually, this is an expression that involves a characteristic or descriptive measure of a population.
    通常,这是涉及人口特征或描述性度量的表达。
  2. Take a random sample from the population of interest.
    从目标总体中随机抽取一个样本。
  3. Calculate a statistic from the sample that provides information about the null hypothesis.
    从样本计算统计信息,以提供有关原假设的信息。
    We use this to make our decision.
    我们以此来做出决定。
  4. If the value of the statistic is consistent with the null hypothesis, then do not reject H 0 H_0 H0.
    如果统计值与原假设一致,则不要拒绝$ H_0 $。
  5. If the value of the statistic is not consistent with the null hypothesis, then reject H 0 H_0 H0 and accept the alternative hypothesis.
    如果统计值与原假设不一致,则拒绝$ H_0 $并接受替代假设。

Type Ⅰ error and Type Ⅱ error

 Error in Statistical Hypothesis Testing   Type of Error   Description   Probability of   Error   Type I Error   Rejecting  H 0  when it is true  α  Type II Error   Not rejecting  H 0 β  when it is false \begin{aligned} &\text { Error in Statistical Hypothesis Testing }\\ & \begin{array}{llc} \hline \text { Type of Error } & \text { Description } & \begin{array}{c} \text { Probability of } \\ \text { Error } \end{array} \\ \hline \text { Type I Error } & \begin{array}{l} \text { Rejecting } H_{0} \\ \text { when it is true } \end{array} & \alpha \\ \text { Type II Error } & \text { Not rejecting } H_0 & \beta \\ &\text { when it is false} & \\ \hline \end{array} \end{aligned}  Error in Statistical Hypothesis Testing  Type of Error  Type I Error  Type II Error  Description  Rejecting H0 when it is true  Not rejecting H0 when it is false Probability of  Error αβ

Procedure - Hypothesis testing (critical value approach)
程序-假设检验(临界值方法)

  1. Determine the null and alternative hypotheses.
    确定原假设和替代假设。

  2. Find a test statistic T T T that will provide evidence that H 0 H_0 H0 should be accepted or rejected (e.g, a large value of the test statistic indicates H 0 H_0 H0 should be rejected).
    找到一个测试统计量$ T , 该 数 据 将 提 供 证 据 表 明 应 接 受 或 拒 绝 ,该数据将提供证据表明应接受或拒绝 H_0 ( 例 如 , 测 试 统 计 量 的 值 较 大 表 示 应 拒 绝 (例如,测试统计量的值较大表示应拒绝 H_0 $)。

  3. Obtain a random sample from the population of interest and compute the observed value of the test statistic t 0 t_0 t0 using the sample.
    从目标总体中获取随机样本,并使用样本计算检验统计量$ t_0 $的观察值。

  4. Using the sampling distribution of the test statistic under the null hypothesis and the significance level, find the critical value(s).
    使用零假设和显着性水平下的检验统计量的抽样分布,找到临界值。
    That is, find the t t t such that
    也就是说,找到这样的$ t $
    Upper Tail Test: P H 0 ( T ≤ t ) = 1 − α P_{H_0}(T\le t) = 1-\alpha PH0(Tt)=1α
    Lower Tail Test: P H 0 ( T ≤ t ) = α P_{H_0}(T\le t) = \alpha PH0(Tt)=α

    Two-Tail Test: { P H 0 ( T ≤ t 1 ) = α / 2 P H 0 ( T ≤ t 2 ) = 1 − α / 2 \left\{ \begin{array}{ll} P_{H_0}(T\le t_1) = \alpha / 2 \\ P_{H_0}(T\le t_2) = 1 - \alpha / 2 \end{array}\right. {PH0(Tt1)=α/2PH0(Tt2)=1α/2
    Where P H 0 ( . ) P_{H_0}(.) PH0(.) denotes the probability under the null hypothesis.
    其中 P H 0 ( . ) P_{H_0}(.) PH0(.)表示原假设下的概率。

  5. If the value of the test statistic t 0 t_0 t0 falls in the critical region, then reject the null hypothesis.
    如果检验统计量$ t_0 $的值落在关键区域,则拒绝原假设。

norminv(P,MU,SIGMA)函数来生成参数为mu,sigma的正态分布的下侧p分位数。

norminv(P,MU,SIGMA,PCOV,ALPHA) produces confidence bounds for X when the input parameters MU and SIGMA are estimates.

image-20201218215308402
>>> norminv(0.95,0,1)
1.6449

The probability of making a Type II error is represented by β \beta β, and it depends on the sample size, the significance level of the test, and the alternative hypothesis.
II型错误发生的可能性由 β \beta β表示,它取决于样本大小,检验的显着性水平以及替代假设。

The last part is important to remember: the probability that we will not detect a departure from the null hypothesis depends on the distribution of the test statistic under the alternative hypothesis.
最后一点很重要,要记住:我们不会检测到与原假设相背离的可能性,取决于替代假设下检验统计量的分布。

Recall that the alternative hypothesis allows for many different possibilities, yielding many distributions under H 0 H_0 H0.
回想一下,替代假设提供了许多不同的可能性,在 H 0 H_0 H0下产生了许多分布。

So, we must determine the Type II error for every alternative hypothesis of interest.
因此,我们必须为每个感兴趣的替代假设确定II型误差。


A more convenient measure of the performance of a hypothesis test is to determine the probability of not making a Type Il error.
假设检验的表现能更方便的度量不发生II型错误的概率。

This is called the power of a test.
这称为检验的功效。

We can consider this to be the probability of rejecting H 0 H_0 H0 when it is really false.
我们可以认为这是当$ H_0 $确实为假时,拒绝它的可能性。

Roughly speaking, one can think of the power as the ability of the hypothesis test to detect a false null hypothesis.
粗略地说,可以将功效视为假设检验检测虚假零假设的能力。
The power is given by
Power = 1 − β (7.2) \text{Power} = 1-\beta \tag{7.2} Power=1β(7.2)
As we see in Example 7.3, the power of the test to detect departures from the null hypothesis depends on the true value of μ \mu μ.
如例7.3所示,检验从原假设中发现偏离的能力取决于$ \mu $的真实值。


例 7.3
H 0 : μ = 45 H 1 : μ > 45 z 0 = x ‾ − μ 0 σ X / n = x ‾ − μ 0 σ X ‾ = 47.2 − 45 15 / 100 = 2.2 1.5 = 1.47 H_0: \qquad \mu = 45 \\ H_1: \qquad \mu>45 \\ z_0 = \frac{\overline{x}-\mu_0}{\sigma_X / \sqrt{n}} = \frac{\overline{x}-\mu_0}{\sigma_{\overline{X}}}=\frac{47.2-45}{15/\sqrt{100}}=\frac{2.2}{1.5}=1.47 H0:μ=45H1:μ>45z0=σX/n xμ0=σXxμ0=15/100 47.245=1.52.2=1.47

Returning to the transportation example, we illustrate the concepts of Type Ⅱ error and power.
回到运输示例,我们说明了Ⅱ型误差和功率的概念。

It is important to keep in mind that these values depend on the true mean u, so we have to calculate the Type Ⅱ error for different values of μ \mu μ.
重要的是要记住,这些值取决于真实的均值u,因此我们必须针对$ \mu $的不同值计算Ⅱ型误差。

First we get a vector of values for μ \mu μ:
首先,我们得到 μ \mu μ的向量值:

% Get several values for the mean under the alternative
% hypothesis. Note that we are getting some values
% below the null hypothesis.
mualt = 40:60;

It is actually easier to understand the power when we look at a test statistic based on x ‾ \overline{x} x rather than z 0 z_0 z0.
当我们查看基于$ \overline{x} 而 不 是 而不是 z_0 $的测试统计信息时,实际上更容易理解功效。

So, we convert the critical value to its corresponding x ‾ \overline{x} x value:
因此,我们将临界值转换为其相应的$ \overline {x} $值:

% Note the critical value:
cv = 1.645;
% cv = norminv(0.95,0,1)
% Note the standard deviation for x-bar:
sig = 1.5;
% It's easier to use the unstandardized version, 
% so convert:
ct = cv*1.5 + 45;

We find the area under the curve to the left of the critical value (the nonrejection region) for each of these values of the true mean.
对于这些均值,我们在临界值(非排斥区域)左侧的曲线下找到面积。

That would be the probability of not rejecting the null hypothesis.
那就是不拒绝原假设的可能性。

% Get a vector of ct values that is 
% the same size as mualt.
ctv = ct*ones(size(mualt));
% Now get the probabilities to the left of this value.
% These are the probabilities of the Type II error.
beta = normcdf(ctv,mualt,sig);

Note that the variable beta contains the probability of Type Ⅱ error (the area to the left of the critical value ctv under a normal curve with mean mualt and standard deviation sig) for every u.
请注意,变量beta包含Ⅱ型错误的概率(在均值mualt和标准偏差sig的正态曲线下,临界值ctv的左侧区域。)

To get the power, simply subtract all of the values for beta from one.
要获得功效,只需计算 1 − β 1-\beta 1β即可。

% To get the power: 1-beta
pow = 1 - beta;

We plot the power against the true value of the population mean in Figure7.2.
在图7.2中,我们针对总体均值的真实值绘制了功效。

Note that as μ > μ 0 \mu > \mu_0 μ>μ0 , the power (or the likelihood that we can detect the alternative hypothesis) increases.
请注意,当 μ > μ 0 \mu > \mu_0 μ>μ0时,功效(或我们可以检测到替代假设的可能性)增加。

plot(mualt,pow);
xlabel('True Mean \mu')
ylabel('Probability of Type II Error - \beta')
axis([40 60 0 1.1])
image-20201218221536811

We leave it as an exercise for the reader to plot the probability of making a Type Ⅱ error.
我们将其留给读者练习来画出犯II型错误的概率。

⇓ \Downarrow

习题7.2 Using the information in Example 7.3, plot the probability of Type Ⅱ error as a function of μ \mu μ. How does this compare with Figure 7.2?

mualt = 40:60;%在备择假设的条件下求得均值
cv = norminv(0.95,0,1);% 定义临界值
sig = 1.5;% 样本均值的标准差为1.5
ct = cv*1.5 + 45;% 将cv的值转化为样本服从的分布值
ctv = ct*ones(size(mualt));% 将ct建立成为一个向量,向量的空间大小与mualt相同
% 用错误的总体均值去算出第二类错误
beta = normcdf(ctv,mualt,sig);%二类错误与总体均值的函数
plot(mualt,beta);%画出函数图像
xlabel('True Mean \mu')
ylabel('Probability of Type II Error - \beta')
axis([40 60 0 1.1])
image-20201220210651206

PROCEDURE-HYPOTHESIS TESTING (P-VALUE APPROACH)

  1. Determine the null and alternative hypotheses.
    确定原假设和备择假设。
  2. Find a test statistic T that will provide evidence about H 0 H_0 H0.
    查找$ H_0 $的检验统计量T。
  3. Obtain a random sample from the population of interest and compute the value of the test statistic t 0 t_0 t0 from the sample.
    从目标总体中获取随机样本,并从样本中计算检验统计量$ t_0 $的值。
  4. Calculate the p-value:
    Lower Tail Test: p-value = P H 0 ( T ≤ t 0 ) P_{H_0}(T\le t_0) PH0(Tt0)
    Upper Tail Test: p-value = P H 0 ( T ≥ t 0 ) P_{H_0}(T\ge t_0) PH0(Tt0)
  5. If the p-value ≤ α \le \alpha α, then reject the null hypothesis.
    如果p值$ \le \alpha $,则拒绝原假设。

For a two-tail test, the p-value is determined similarly.
对于双尾检验,p值的确定方法类似。

例 7.4

In this example, we repeat the hypothesis test of Example 7.2 using the p-value approach.
在此示例中,我们使用p值方法重复示例7.2的假设检验。

% Example 7.2
cv = norminv(0.95,0,1);

First we set some of the values we need:
首先,我们设置一些我们需要的值:

mu = 45;									
sig = 1.5;
xbar = 47.2;
% Get the observed value of test statistic.
zobs = (xbar - mu)/sig;

The p-value is the area under the curve greater than the value for zobs.
p值是曲线下的面积大于zobs的值。

We can find it using the following command:
我们可以使用以下命令找到它:

pval = 1-normcdf(zobs,0,1);

We get a p-value of 0.071.
计算得出P值为0.071

If we are doing the hypothesis test at the 0.05 significance level, then we would not reject the null hypothesis.
如果我们在0.05显着性水平上进行假设检验,那么我们将不会拒绝原假设。

This is consistent with the results we had previously.
这与我们之前的结果一致。

Note that in each approach, knowledge of the distribution of T under the null hypothesis H 0 H_0 H0 is needed.
注意,在每种方法中,都需要了解零假设$ H_0 $下T的分布。

How to tackle situations where we do not know the distribution of our statistic is the focus of the rest of the chapter.
本章其余部分将重点介绍如何处理我们不知道统计信息分布的情况。

Monte Carlo Methods for Inferential Statistics

The sampling distribution is known for many statistics.
抽样分布对于许多统计数据都是众所周知的。

However, these are typically derived using assumptions about the underlying population understudy or for large sample sizes.
但是,这些通常是使用有关总体研究不足或样本量较大的假设得出的。

In many cases, we do not know the sampling distribution for the statistic, or we cannot be sure that the assumptions are satisfied.
在许多情况下,我们不知道该统计数据的抽样分布,或者我们无法确定是否满足这些假设。

We can address these cases using Monte Carlo simulation methods, which is the topic of this section.
我们可以使用蒙特卡洛模拟方法解决这些情况,这是本节的主题。

Some of the uses of Monte Carlo simulation for inferential statistics are the following:
蒙特卡洛模拟在推理统计中的一些用途如下:

  • Performing inference when the distribution of the test statistic is not known analytically,
    如果无法通过分析得知测试统计量的分布,则进行推断,
  • Assessing the performance of inferential methods when parametric assumptions are violated,
    在违反参数假设时评估推论方法的性能,
  • Testing the null and alternative hypotheses under various conditions,
    在各种条件下检验原假设和替代假设,
  • Evaluating the performance (e.g., power) of inferential methods, and
    评估推理方法的性能(例如功效)
  • Comparing the quality of estimators.
    比较估算器的质量。

In this section, we cover situations in inferential statistics where we do know something about the distribution of the population or we are willing to make assumptions about the distribution.
在本节中,我们介绍了推断统计中的一些情况,在这些情况中我们确实了解总体分布,或者愿意对分布进行假设。

In the next section, we discuss bootstrap methods that can be used when no assumptions are made about the underlying distribution of the population.
在下一节中,我们讨论了在不对总体的基础分布进行任何假设的情况下可以使用的bootstrap方法。

Basis Monte Carlo Procedure

The fundamental idea behind Monte Carlo simulation for inferential statistics is that insights regarding the characteristics of a statistic can be gained by repeatedly drawing random samples from the same population or distribution of interest and observing the behavior of the statistic over the samples.
蒙特卡洛推论统计背后的基本思想是,可以通过重复从相同的总体或感兴趣的分布中抽取随机样本,并观察样本上的统计行为,来获得有关统计特征的了解。

In other words, we estimate the distribution of the statistic by randomly sampling from the population or distribution and recording the value of the statistic for each sample.
换句话说,我们通过从总体或分布中随机抽样并记录每个样本的统计值来估计统计的分布。

The observed values of the statistic for these samples are used to estimate the distribution.
这些样本的统计观察值用于估计分布。

The first step is to decide on a pseudo-population or distribution that the analyst assumes represents the real population in all relevant aspects.
第一步是确定分析假设在所有相关方面代表真实总体的伪分布或分布。

We use the word pseudo here to emphasize the fact that we obtain our samples using a computer and pseudo random numbers.
我们在这里使用“伪”一词来强调这一事实,即我们使用计算机和伪随机数获取样本。

For example, we might assume that the underlying population is exponentially distributed if the random variable represents the time before a part fails, or we could assume the random variable comes from a normal distribution if we are measuring IQ scores.
例如,如果随机变量表示零件失效之前的时间,则可以假定基础总体呈指数分布,或者,如果我们测量IQ分数,则可以假定随机变量来自正态分布。

In this text, we consider this type of Monte Carlo simulation to be a parametric technique because we sample from a known (or assumed)distribution.
在本文中,由于我们从已知(或假设的)分布中采样,因此我们将这种类型的蒙特卡洛模拟视为一种参数化技术。

The basic Monte Carlo procedure is outlined here.
此处概述了基本的蒙特卡洛程序。

Later, we provide procedures illustrating some specific uses of Monte Carlo simulation as applied to statistical hypothesis testing.
稍后,我们提供了一些程序,这些程序说明了应用于统计假设检验的蒙特卡洛模拟的某些特定用途。

PROCEDURE-BASIC MONTE CARLO SIMULATION

  1. Determine the pseudo-population or model that represents the true population of interest.
    决定一个伪分布或模型,代表真实总体的信息
  2. Use a sampling procedure to sample from the pseudo-population or distribution.
    使用抽样程序从伪总体或分布中抽样。
  3. Calculate a value for the statistic of interest and store it.
    计算样本统计量的信息
  4. Repeat steps 2 and 3 for M trials.
    重复m次试验
  5. Use the M values found in step 4 to study the distribution of the statistic.
    从第四步产生的m个数据中学习统计量的分布

It is important to keep in mind that when sampling from the pseudo-population, the analyst should ensure that all relevant characteristics reflect the statistical situation.
重要的是要记住,当从伪人口抽样时,分析人员应确保所有相关特征均能反映统计情况。

For example, the same sample size and sampling strategy should be used when trying to understand the performance of a statistic using simulation.
例如,在尝试通过仿真了解统计信息的性能时,应使用相同的样本量和抽样策略。

This means that the distribution for the statistic obtained via Monte Carlo simulation is relevant only for the conditions of the sampling procedure and the assumptions about the pseudo-population.
这意味着通过蒙特卡洛模拟获得的统计信息的分布仅与采样过程的条件以及有关伪人口的假设有关。

Note that in the last step of the Monte Carlo simulation procedure, the analyst can use the estimated distribution of the statistic to study characteristics of interest.
请注意,在蒙特卡洛模拟过程的最后一步中,分析人员可以使用统计信息的估计分布来研究感兴趣的特征。

For example, one could use this information to estimate the skewness, bias, standard deviation, kurtosis, and many other characteristics.
例如,可以使用此信息来估计偏度,偏差,标准差,峰度和许多其他特征。

Monte Carlo Hypothesis Testing

Recall that in statistical hypothesis testing, we have a test statistic that provides evidence that the null hypothesis should be rejected or not.
回想一下,在统计假设检验中,我们有一个检验统计量,该证据提供了是否应拒绝原假设的证据。

Once we observe the value of the test statistic, we decide whether that particular value is consistent with the null hypothesis.
一旦我们观察到检验统计量的值,就可以确定该特定值是否与原假设一致。

To make that decision, we must know the distribution of the statistic when the null hypothesis is true.
要做出此决定,我们必须知道原假设为真时的统计分布。

Estimating the distribution of the test statistic under the null hypothesis is one of the goals of Monte Carlo hypothesis testing.
在原假设下估计检验统计量的分布是蒙特卡洛假设检验的目标之一。

We discuss and illustrate the Monte Carlo method as applied to the critical value and p-value approaches to hypothesis testing.
我们讨论并说明了应用于假设检验的临界值和p值方法的蒙特卡洛方法。

Recall that in the critical value approach to hypothesis testing, we are given a significance level α \alpha α.
回想一下在假设检验的临界值方法中,我们得到了显着性水平$ \alpha $。

We then use this significance level to find the appropriate critical region in the distribution of the test statistic when the null hypothesis is true.
然后,当原假设为真时,我们使用此显着性水平在检验统计量的分布中找到适当的关键区域。

Using the Monte Carlo method, we determine the critical value using the estimated distribution of the test statistic.
使用蒙特卡洛方法,我们使用检验统计量的估计分布来确定临界值。

The basic procedure is to randomly sample many times from the pseudo-population representing the null hypothesis, calculate the value of the test statistic at each trial, and use these values to estimate the distribution of the test statistic.
基本程序是从代表原假设的伪分布中随机抽样多次,在每次试验中计算检验统计量的值,并使用这些值估算检验统计量的分布。

PROCEDURE- MONTE CARLO HYPOTHESIS TESTING(CRITICAL VALUE)

  1. Using an available random sample of size n n n from the population of interest, calculate the observed value of the test statistic, t 0 t_0 t0
    使用来自感兴趣总体的大小为$ n 的 可 用 随 机 样 本 , 计 算 检 验 统 计 量 的 观 察 值 的可用随机样本,计算检验统计量的观察值 t_0 $
  2. Decide on a pseudo-population that reflects the characteristics of the true population under the null hypothesis.
    确定反映零假设下真实分布特征的伪分布。
  3. Obtain a random sample of size n n n from the pseudo-population.
    从伪分布中获得大小为$ n $的随机样本。
  4. Calculate the value of the test statistic using the random sample in step 3 and record it.
    使用步骤3中的随机样本计算检验统计量的值,并将其记录下来。
  5. Repeat steps 3 and 4 for M trials.
    We now have values t 1 , t 2 , . . . , t M t_1,t_2,...,t_M t1,t2,...,tM, that serve as an estimate of the distribution of the test statistic, T T T, when the null hypothesis is true.
    重复步骤3和4,M次。
    现在,我们有值$ t_1,t_2,…,t_M , 当 零 假 设 为 真 时 , 可 作 为 检 验 统 计 量 ,当零假设为真时,可作为检验统计量 T $分布的估计值。
  6. Obtain the critical value for the given significance level α \alpha α:
    获得给定显着性水平$ \alpha $的临界值:
    Lower Tail Test: get the α \alpha α-th sample quantile --> q ^ α \hat{q}_{\alpha} q^α, from the t 1 , t 2 , . . . , t M t_1,t_2,...,t_M t1t2...tM .
    Upper Tail Test: get the ( 1 − α ) (1-\alpha) (1α)-th sample quantile, q ^ ( 1 − α ) \hat{q}(1-\alpha) q^(1α), from the t 1 , t 2 , . . . , t M t_1,t_2,...,t_M t1t2...tM .
    Two-Tail Test: get the sample quantiles q ^ ( α / 2 ) \hat{q}(\alpha / 2) q^(α/2) and q ^ ( 1 − α / 2 ) \hat{q}(1-\alpha/2) q^(1α/2) from the t 1 , t 2 , . . . , t M t_1,t_2,...,t_M t1t2...tM .
  7. If t 0 t_0 t0 falls in the critical region, then reject the null hypothesis.
    如果$ t_0 $落在关键区域,则拒绝原假设。

The critical values in step 6 can be obtained using the estimate of a sample quantile that we discussed in Chapter 3.
可以使用我们在第3章中讨论的样本分位数的估计值来获得步骤6中的临界值。

The function csquantiles from the Computational Statistics Toolbox is also available to find these values.
计算统计工具箱中的函数csquantiles也可以找到这些值。

In the examples given next, we apply the Monte Carlo method to a familiar hypothesis testing situation where we are testing a hypothesis about the population mean.
在下面给出的示例中,我们将蒙特卡洛方法应用于熟悉的假设检验情况,在该情况下,我们正在检验关于总体均值的假设。

As we saw earlier, we can use analytical approaches for this type of test.
如我们先前所见,我们可以将分析方法用于这种类型的测试。

We use this simple application in the hope that the reader will better understand the ideas of Monte Carlo hypo
thesis testing and then easily apply them to more complicated problems.
我们使用这个简单的应用程序,希望读者能更好地理解蒙特卡洛假设的思想,然后轻松地将其应用于更复杂的问题。

Example 7.6

This toy example illustrates the concepts of Monte Carlo hypothesis testing.
这个玩具示例说明了蒙特卡洛假设检验的概念。

The mcdata data set contains 25 observations.

We are interested in using these data to test the following null and alternative hypotheses:
H 0 : μ = 454 H 1 : μ < 454 \begin{array}{ll} H_{0}: & \mu=454 \\ H_{1}: & \mu<454 \end{array} H0:H1:μ=454μ<454
We will perform our hypothesis test using simulation to get the critical values.
我们将使用模拟进行假设检验,以获取临界值。

We decide to use the following as our test statistic
z = x ‾ − 454 σ / n z = \frac{\overline{x} - 454}{\sigma/ \sqrt{n}} z=σ/n x454
First, we take care of some preliminaries.

addpath('C:\Users\Clancey\Documents\Nutstore\我的坚果云\研究生学习\课程笔记\3统计计算与优化\CSToolbox V3')
% Load up the data.
load mcdata
n = length(mcdata);  								
% Population sigma is known.
sigma = 7.8;					
sigxbar = sigma/sqrt(n);
% Get the observed value of the test statistic.
Tobs = (mean(mcdata)-454)/sigxbar;

The observed value of the test statistic is t 0 = − 2.56 t_0 =-2.56 t0=2.56.
测试统计量的观察值为$ t_0 = -2.56 $。

The next step is to decide on a model for the population that generated our data.
下一步是为生成我们数据的总体确定模型。

We suspect that the normal distribution with σ = 7.8 \sigma = 7.8 σ=7.8 is a good model, and we check this assumption using a normal probability plot.
我们怀疑$ \sigma = 7.8 $的正态分布是一个很好的模型,我们使用正态概率图检查了这个假设。

The resulting plot in Figure 7.4 shows that we can use the normal distribution as the pseudo-population.
图7.4中的结果图表明,我们可以使用正态分布作为伪分布。

normplot(mcdata)

运行上面的命令,画出了正态分布和数据mcdata的QQ图

image-20201219095816342
This normal probability plot for the mcdata data shows that assuming a normal distribution for the data is reasonable.
这个针对mcdata数据的正态概率图表明,假设数据为正态分布是合理的。

We are now ready to implement the Monte Carlo simulation.
我们现在准备实施蒙特卡罗模拟

We use 1000 trials in this example.
我们在这个例子中使用了1000次试验

At each trial, we randomly sample from the distribution of the test statistic under the null hypothesis (the normal distribution with μ = 454 \mu = 454 μ=454 and σ = 7.8 \sigma= 7.8 σ=7.8) and record the value of the test statistic.
在每一次试验中,我们从检验统计量在零假设下的分布( μ = 454 \mu = 454 μ=454 σ = 7.8 \sigma= 7.8 σ=7.8的正态分布)随机抽样,并记录检验统计量的值。

% Number of Monte Carlo trials
M = 1000;				
% Storage for test statistics from the MC trials.
Tm = zeros(1,M);
% Start the simulation.
for i = 1:M
			% Generate a random sample under H_0
			% where n is the sample size.
			xs = sigma*randn(1,n) + 454;  % randn(1,n)就是Monte Carlo 的精髓,体现的随机化
			Tm(i) = (mean(xs) - 454)/sigxbar;
end

Now that we have the estimated distribution of the test statistic contained in the variable Tm, we can use that to estimate the critical value for a lower tall test.
现在我们有了变量Tm中包含的测试统计量的估计分布,我们可以使用它来估计下尾测试的临界值。

% Get the critical value for alpha.
% This is a lower-tail test, so it is the
% alpha quantile.
alpha = 0.05;
cv = csquantiles(Tm,alpha);

We get an estimated critical value of − 1.75 -1.75 1.75.

Since the observed value of our test statistic is t 0 = − 2.56 t_0 =-2.56 t0=2.56, which is less than the estimated critical value, we reject H 0 H_0 H0.

The procedure for Monte Carlo hypothesis testing using the p-value approach is similar.
使用p值方法进行蒙特卡洛假设检验的过程相似。

Instead of finding the critical value from the simulated distribution of the test statistic,we use it to estimate the p-value.
与其从检验统计量的模拟分布中找到临界值,不如使用它来估计p值。

PROCEDURE- MONTE CARLO HYPOTHESIS TESTING(P-VALUE)

  1. For a random sample of size n n n to be used in a statistical hypothesis test, calculate the observed value of the test statistic, t 0 t_0 t0 .
    对于要在统计假设检验中使用的大小为$ n 的 随 机 样 本 , 计 算 检 验 统 计 量 的 观 测 值 的随机样本,计算检验统计量的观测值 t_0 $。

  2. Decide on a pseudo-population that reflects the characteristics of the population under the null hypothesis.
    确定能反映原假设下总体特征的伪总体。

  3. Obtain a random sample of size n n n from the pseudo-population.
    从伪总体中获得大小为$ n $的随机样本。

  4. Calculate the value of the test statistic using the random sample in step 3 and record it as t i t_i ti .
    在步骤3中使用随机样本计算检验统计量的值,并将其记录为$ t_i $。

  5. Repeat steps 3 and 4 for M M M trials.
    重复步骤3和4次。
    We now have values t 1 , . . . , t M t_1,...,t_M t1,...,tM, that serve as an estimate of the distribution of the test statistic, T T T, when the null hypothesis is true.
    现在,我们有值$ t_1,…,t_M , 当 原 假 设 为 真 时 , 可 作 为 检 验 统 计 量 ,当原假设为真时,可作为检验统计量 T $分布的估计值。

  6. Estimate the p-value using the distribution found in step 5, using the following.

Lower Tail Test: p ^ \hat{p} p^-value = ♯ ( t i ≤ t 0 ) M , i = 1 , 2 , . . . , M \frac{\sharp(t_i \le t_0)}{M},\quad i = 1,2,...,M M(tit0),i=1,2,...,M


Upper Tail Test: p ^ \hat{p} p^-value = ♯ ( t i ≥ t 0 ) M , i = 1 , 2 , . . . , M \frac{\sharp(t_i \ge t_0)}{M},\quad i = 1,2,...,M M(tit0),i=1,2,...,M

  1. If p ^ \hat{p} p^-value ≤ α \le \alpha α, then reject the null hypothesis.

Example 7.7

We return to the situation in Example 7.6 and apply Monte Carlo simulation to the p-value approach to hypothesis testing.
我们返回示例7.6中的情况,并将蒙特卡洛模拟应用于p值方法进行假设检验。

Just to change things a bit, we use the sample mean as our test statistic.
只是改变一点,我们使用样本均值作为检验统计量。

% Let's change the test statistic to xbar.
Tobs = mean(mcdata);
% Number of Monte Carlo trials.
M = 1000;				
% Start the simulation.
Tm = zeros(1,M);
for i = 1:M
			% Generate a random sample under H_0.
			xs = sigma*randn(1,n) + 454;
			Tm(i) = mean(xs);
end

We find the estimated p-value by counting the number of observations in Tm that are below the value of the observed value of the test statistic and dividing by M.
我们通过计算Tm中低于检验统计量的观察值的值的观察数并除以M来找到估计的p值。

% Get the p-value. This is an upper-tail test.
% Find all of the values from the simulation that are 
% below the observed value of the test statistic.
ind = find(Tm <= Tobs);
pvalhat = length(ind)/M;  
% 由于随机因素的存在,这里的程序计算出来的数值不一定是0.007
% 但是接近

We have an estimated p-value given by 0.007 0.007 0.007.

If the significance level of our test is α = 0.05 \alpha = 0.05 α=0.05, then we would reject the null hypothesis.
如果检验的显着性水平为$ \alpha = 0.05 $,则我们将拒绝原假设。

Monte Carlo Assessment of Hypothesis Testing

用蒙特卡洛估计两类错误 Type Ⅰ error and Type Ⅱ error

Monte Carlo simulation can be used to evaluate the performance of an inference model or hypothesis test in terms of the Type Ⅰ error and the Type Ⅱ error.
蒙特卡洛模拟可用于根据Ⅰ型误差和Ⅱ型误差评估推理模型或假设检验的性能。

For some statistics, such as the sample mean, these errors can be determined analytically.
对于某些统计数据(例如样本均值),可以通过分析确定这些误差。

However, what if we have an inference test where the assumptions of the standard methods might be violated or the analytical methods cannot be applied?
但是,如果我们进行了推理测试,可能会违反标准方法的假设或无法应用分析方法,该怎么办?

For instance, suppose we choose the critical value by using a normal approximation (when our test statistic is not normally distributed), and we need to assess the results of doing that?
例如,假设我们使用正态近似来选择临界值(当我们的测试统计量不是正态分布时),我们需要评估这样做的结果吗?

In these situations, we can use Monte Carlo simulation to estimate the Type Ⅰ and the Type Ⅱ error.
在这种情况下,我们可以使用蒙特卡洛模拟来估计Ⅰ型和Ⅱ型误差。

We first outline the procedure for estimating the Type Ⅰ error.
我们首先概述估计Ⅰ型误差的过程。

Because the Type Ⅰ error occurs when we reject the null hypothesis test when it is true, we must sample from the pseudo-population that represents H 0 H_0 H0.
因为当我们拒绝原假设检验时,就会发生Ⅰ类错误,因此我们必须从代表$ H_0 $的伪种群中采样。

PROCEDURE- MONTE CARLO ASSESSMENT OF TYPE Ⅰ ERROR

  1. Determine the pseudo-population or distribution when the null hypothesis is true.
    当原假设为真时,确定伪总体或分布。

  2. Generate a random sample of size n from this pseudo-population.
    从该伪总体生成大小为n的随机样本。

  3. Perform the hypothesis test using the critical value.
    使用临界值执行假设检验。

  4. Determine whether a Type Ⅰ error has been committed.
    确定是否发生了Ⅰ类错误。
    In other words, was the null hypothesis rejected?
    换句话说,原假设是否被拒绝?
    We know not be rejected because we are sampling from the distribution according to the null hypothesis.
    我们知道不会被拒绝,因为我们正在根据原假设从分布中进行抽样。
    Record the result for this trial as
    将该试验的结果记录为
    I i = { 1 ;  Type I error is made  0 ;  Type I error is not made.  I_{i}=\left\{\begin{array}{ll} 1 ; & \text { Type I error is made } \\ 0 ; & \text { Type I error is not made. } \end{array}\right. Ii={1;0; Type I error is made  Type I error is not made. 

  5. Repeat steps 2 through 4 for M M M trials.
    重复步骤2到4, M M M

  6. The probability of making a Type Ⅰ error is
    发生Ⅰ类错误的概率为
    α ^ = 1 M ∑ i = 1 m I i (7.9) \hat{\alpha} = \frac{1}{M} \sum_{i=1}^{m} I_i \tag{7.9} α^=M1i=1mIi(7.9)

Note that in step 6, this is the same as calculating the proportion of times the null hypothesis is falsely rejected out of M M M trials.
请注意,在步骤6中,这与计算$ M $试验中原假设被错误拒绝的次数的比例相同。

This provides an estimate of the significance level of the test for a given critical value.
对于给定的临界值,这提供了测试显着性水平的估计。

The procedure is similar for estimating the Type Ⅱ error of a hypothesis test.
估计假设检验的Ⅱ类错误的过程与之类似。

However, this error is determined by sampling from the distribution when the null hypothesis is false.
但是,当原假设为假时,通过从分布中采样来确定此错误。

There are many possibilities for the Type Ⅱ error, and the analyst should investigate the Type Ⅱ error for those alternative hypotheses that are of interest.
Ⅱ型误差有很多可能性,分析人员应针对感兴趣的备择假设研究Ⅱ型误差。

PROCEDURE- MONTE CARLO ASSESSMENT OF TYPE Ⅱ ERROR

  1. Determine the pseudo-population or distribution when the null hypothesis is false.
    当原假设为假时,确定伪总体或分布。

  2. Generate a random sample of size n from this pseudo-population.
    从该伪总体生成大小为n的随机样本。

  3. Perform the hypothesis test using the significance level α \alpha α and corresponding critical value.
    使用显着性水平$ \alpha $和相应的临界值执行假设检验。
    这一点和下面的都和第一类错误不一样

  4. Note whether a Type Ⅱ error has been committed;
    注意是否发生了Ⅱ型错误。
    i.e., was the null hypothesis not rejected?
    就是说,原假设不被拒绝?
    Record the result for this trial as
    将该试验的结果记录为
    I i = { 1 ;  Type Ⅱ error is made  0 ;  TypeⅡ error is not made.  I_{i}=\left\{\begin{array}{ll} 1 ; & \text { Type Ⅱ error is made } \\ 0 ; & \text { TypeⅡ error is not made. } \end{array}\right. Ii={1;0; Type Ⅱ error is made  TypeⅡ error is not made. 

  5. Repeat steps 2 through 4 for M M M trials.
    重复步骤2到4, M M M

  6. The probability of making a Type Ⅱ error is
    发生Ⅰ类错误的概率为
    β ^ = 1 M ∑ i = 1 m I i (7.9) \hat{\beta} = \frac{1}{M} \sum_{i=1}^{m} I_i \tag{7.9} β^=M1i=1mIi(7.9)

The Type Ⅱ error rate is estimated using the proportion of times the null hypothesis is not rejected (when it should be) out of M M M trails.
使用$ M $次试验中未否定原假设(应为原假设)的次数的比例估算Ⅱ类错误率。

Example 7.8

For the hypothesis test in Example 7.6, we had a critical value (from theory)of − 1.645 -1.645 1.645.
对于示例7.6中的假设检验,我们的临界值(根据理论)为$ -1.645 $。

We can estimate the significance level of the test using the following steps:
我们可以使用以下步骤来评估测试的显著性水平:

addpath('C:\Users\Clancey\Documents\Nutstore\我的坚果云\研究生学习\课程笔记\3统计计算与优化\CSToolbox V3')
% Load up the data.
load mcdata
n = length(mcdata);  
M = 1000;
alpha = 0.05;
sigma = 7.8;
sigxbar = sigma/sqrt(n);

% Get the critical value, using z as test statistic.
cv = norminv(alpha,0,1);
% Start the simulation.
Im = 0;

for i = 1:M
    % Generate a random sample under H_0.
    xs = sigma*randn(1,n) + 454;
    Tm = (mean(xs)-454)/sigxbar;
    if Tm <= cv % then reject H_0
        Im = Im +1;
    end
end
alphahat = Im/M;

A critical value of -1.645 in this situation corresponds to a desired probability of Type Ⅰ error of 0.05.
在这种情况下,临界值为-1.645对应于Ⅰ型错误的期望为0.05。

From this simulation, we get an estimated value of 0.045, which is very close to the theoretical value.
从该模拟中,我们获得了0.045的估计值,该值非常接近理论值。

We now check the Type Ⅱ error in this test.
现在,我们在此测试中检查Ⅱ型错误。

Note that we now have to sample from the alternative hypotheses of interest.
注意,我们现在必须从感兴趣的替代假设中取样。

% Now check the probability of Type II error.
% Get some alternative hypotheses:
mualt = 445:458;
betahat = zeros(size(mualt));
for j = 1:length(mualt)
    Im = 0;
    % Get the true mean.
    mu = mualt(j);
    for i = 1:M
        % Generate a sample from H_1.
        xs = sigma*randn(1,n) + mu;
        Tm = (mean(xs)-454)/sigxbar;
        if Tm > cv % Then did not reject H_0.
            Im = Im +1;
        end
    end
    betahat(j) = Im/M;
end
% Get the estimated power.
powhat = 1-betahat;

We plot the estimated power as a function of u in Figure 7.5.
我们在图7.5中绘制了估计势与u的关系。

As expected, as the true value for μ \mu μ gets closer to 454 (the mean under the null hypothesis), the power of the test decreases.
正如预期的那样,随着$\mu $的真实值越来越接近454(零假设下的平均值),检验的功效下降。

% Plot to get Figure 7.5
plot(mualt,powhat,'o-')
ylabel('Estimated Power')
xlabel('\mu')
image-20201219112046258

In other cases, we might need to know whether the distribution of the statistic changes with sample size or skewness in the population or some other characteristic of interest.
在其他情况下,我们可能需要知道统计的分布是否随样本量或总体中的偏度或感兴趣的其他特征而变化。

These variations are easily investigated using multiple Monte Carlo experiments.
使用多个蒙特卡洛实验可以轻松地研究这些变化。

One quantity that the researcher must determine is the number of trials that are needed in Monte Carlo simulations.
研究人员必须确定的一个数量是蒙特卡洛模拟中所需的试验次数。

This often depends on the computing assets that are available.
这通常取决于可用的计算能力。

If time and computer resources are not an issue, then M M M should be made as large as possible.
如果时间和计算机资源不是问题,则应将$ M $设置得尽可能大。

Hope [1968] showed that results from a Monte Carlo simulation are unbiased for any M, under the assumption that the programming is correct.
Hope(1968)指出,在编程正确的前提下,对于任何M,蒙特卡罗模拟的结果都是无偏的。

Mooney [1997] states that there is no general theory that governs the number of trials in Monte Carlo simulation.
Mooney [1997]指出,没有通用理论来控制蒙特卡洛模拟中的试验次数。

However, he recommends the following general guidelines.
但是,他建议以下一般准则。

The researcher should first use a small number of trials and ensure that the program is working properly.
研究人员应首先使用少量试验,并确保程序正常运行。

Once the code has been checked, the simulation or experiments can be run for very large M.
一旦检查了代码,就可以对非常大的M运行仿真或实验。

Most simulations would have M > 1000 M > 1000 M>1000 , but M between 10,000 and 25,000 is not uncommon.
大多数模拟的$ M> 1000 $,但是M在10,000和25,000之间的情况并不罕见。

One important guideline for determining the number of trials is the purpose of the simulation.
确定试验次数的一项重要指南是模拟的目的。

If the tail of the distribution is of interest(e.g., estimating Type Ⅰ error, getting p-values, etc.), then more trials are needed to ensure that there will be a good estimate of that area.
如果感兴趣的是分布的尾部(例如,估计Ⅰ型误差,获得p值等),则需要进行更多的试验以确保对该区域有一个良好的估计。

Chapter 8 Data Partitioning

Chapter 9 Probability Density Estimation

Chapter 10 Supervised Learning

Chapter 11 Unsupervised Learning

Chapter 12 Parametric Models

Chapter 13 Nonparametric Models

Chapter 14 Markov Chain Monte Carlo Methods

p613

Introduction

Background

Bayesian Inference

略,给出几个公式
P ( D , θ ) = P ( θ ) P D ∣ θ ) P(D,\theta) = P(\theta) PD|\theta) P(D,θ)=P(θ)PDθ)

P ( θ ∣ D ) = P ( θ ) P ( D ∣ θ ) ∫ P ( θ ) P ( D ∣ θ ) d θ (14.1) P(\theta|D) = \frac{P(\theta) P(D|\theta)}{\int P(\theta) P(D|\theta) d\theta} \tag{14.1} P(θD)=P(θ)P(Dθ)dθP(θ)P(Dθ)(14.1)

P ( θ ∣ D ) ∝ P ( θ ) P ( D ∣ θ ) = P ( θ ) L ( θ ; D ) P(\theta|D) \propto P(\theta) P(D|\theta) = P(\theta) L(\theta;D) P(θD)P(θ)P(Dθ)=P(θ)L(θ;D)

E [ f ( θ ) ∣ D ] = ∫ f ( θ ) P ( θ ) P ( D ∣ θ ) d θ ∫ P ( θ ) P ( D ∣ θ ) d θ (14.2) E[f(\theta)|D] = \frac{ \int f(\theta)P(\theta) P(D|\theta) d\theta }{ \int P(\theta) P(D|\theta) d\theta } \tag{14.2} E[f(θ)D]=P(θ)P(Dθ)dθf(θ)P(θ)P(Dθ)dθ(14.2)

E [ f ( X ) ] = ∫ f ( x ) π ( x ) d x ∫ π ( x ) d x (14.3) E[f(X)] = \frac{\int f(x) \pi(x) dx}{\int \pi(x) dx} \tag{14.3} E[f(X)]=π(x)dxf(x)π(x)dx(14.3)

Monte Carlo Integration

Monte Carlo integration estimates the integral E [ f ( X ) ] E[f(X)] E[f(X)] of Equation 14.3 by obtaining samples X t , t = 1 , . . . , n X_t, t = 1,..., n Xt,t=1,...,n from the distribution π ( x ) \pi(x) π(x) and calculating
蒙特卡洛积分通过从分布 π ( x ) \pi(x) π(x)获得样本$ X_t,t = 1,…,n 并 计 算 得 出 方 程 14.3 的 积 分 并计算得出方程14.3的积分 14.3E[f(X)]$
E [ f ( X ) ] ≈ 1 n ∑ t = 1 n f ( X t ) (14.4) E[f(X)] \approx \frac{1}{n} \sum_{t=1}^{n} f(X_t) \tag{14.4} E[f(X)]n1t=1nf(Xt)(14.4)
The notation t t t is used here because there is an ordering or sequence to the random variables in MCMC methods.
这里使用符号$ t $是因为MCMC方法中的随机变量具有排序或序列。

We know that when the X t X_t Xt are independent, then the approximation can be made as accurate as needed by increasing n n n.
我们知道,当$ X_t 是 独 立 的 时 , 通 过 增 加 是独立的时,通过增加 n $可以使近似值尽可能精确。

We will see in the following sections that with MCMC methods,the samples are not independent in most cases.
在以下各节中,我们将看到使用MCMC方法时,在大多数情况下样本并不是独立的。

That does not limit their use in finding integrals using approximations such as Equation 14.4.
这不限制它们在使用近似式(例如公式14.4)查找积分时的使用。

However,care must be taken when determining the variance of the estimate in Equation 14.4 because of dependence [Gentle, 1998; Robert and Casella,1999].
但是,由于依赖关系,在公式14.4中确定估计值的方差时必须小心[Gentle,1998; Robert and Casella,1999]。

We illustrate the method of Monte Carlo integration in the next example.
在下一个示例中,我们将说明Monte Carlo积分的方法。

Example 14.1 假设知道了分布,怎么去计算期望?

For a distribution that is exponential with λ = 1 \lambda = 1 λ=1, we find E [ X ] E[X] E[X] using Equation 14.4.
对于 λ = 1 \lambda = 1 λ=1的指数分布,我们使用公式14.4找到$ E [X] $。

We generate random variables from the required distribution, take the square root of each one and then find the average of these values.
我们根据所需的分布生成随机变量,取每个变量的平方根,然后求出这些值的平均值。

This is implemented next in MATLAB.
接下来在MATLAB中实现。

% Generate 500 exponential random 
% variables with lambda = 1.
% This is a Statistics Toolbox function.
% 指数分布抽样,第一个1是lambda,第二个1是行数,第三个1000是列数
x = exprnd(1,1,1000);								
% Take square root of each one.
xroot = sqrt(x);
% Take the mean - Equation 11.4
% 用matlab的mean()函数来计算数值解,精度较低
exroothat = mean(xroot);   

From this, we get estimate of 0.889.

We can use MATLAB to find the value using numerical integration, as shown here.
我们可以使用MATLAB通过数值积分查找值,如下所示。

% Now get it using numerical integration
% 因为中间不参与运算,所以用字符型函数来跟踪
strg = 'sqrt(x).*exp(-x)';  % 字符型算法
myfun = inline(strg);   % 用inline变为函数
% quadl is a MATLAB 6 function.
exroottru = quadl(myfun,0,50);  % 0到50内积分

From this, we get a value of 0.886, which is close to the Monte Carlo method.
由此得出的值是0.886,与蒙特卡洛方法很接近。

注:inline()函数在以后的MATLAB中会删掉,推荐使用匿名函数来实现,匿名函数的定义和Python中类似,如下:

%  匿名函数
myfun1 = @(x) sqrt(x).*exp(-x);
xroottru_new = integral(myfun,0,50);

The samples X X X, do not have to be independent as long as they are generated using a process that obtains samples from the “entire” domain of π ( x ) \pi(x) π(x) and in the correct proportions [Gilks, et al., 1996a].
样本$ X 不 必 独 立 , 只 要 它 们 是 使 用 从 不必独立,只要它们是使用从 使\pi(x)$的“整个”域以正确比例获取样本的过程生成的即可[Gilks, et al., 1996a].

This can be done by constructing a Markov chain that has π ( x ) \pi(x) π(x) as its stationary distribution.
这可以通过构造一个以 π ( x ) \pi(x) π(x) 作为固定分布的马尔可夫链来完成。

We now give a brief description of Markov chains.
现在,我们对马尔可夫链进行简要描述。


Markov Chains

A Markov chain is a sequence of random variables such that the next value or state of the sequence depends only on the previous one.
马尔可夫链是一系列随机变量,此序列的下一个值或状态仅取决于前一个。

Thus, we are generating a sequence of random variables, X 0 , X 1 , . . . X_0,X_1,... X0,X1,... such that the next state X t + 1 X_{t+1} Xt+1, with t > 0 t>0 t>0 is distributed according to P ( X t + 1 ∣ X t ) P(X_{t+1}|X_t) P(Xt+1Xt), which is called the transition kernel.
因此,我们正在生成一系列随机变量$ X_0,X_1,… , 以 便 下 一 个 状 态 ,以便下一个状态 便X_{t+1} 的 分 布 依 赖 于 的分布依赖于 P(X_{t+1}|X_t)$,称为“过渡内核”。

A realization of this sequence is also called a Markov chain.
此序列的实现也称为马尔可夫链。

We assume that the transition kernel does not depend on t , making the chain time-homogeneous.
我们假设过渡内核不依赖于t,从而使链时间均匀。


One issue that must be addressed is how sensitive the chain is to the starting state X 0 X_0 X0 .
必须解决的一个问题是链对起始状态$ X_0 $的敏感程度。

Given certain conditions [Robert and Casella, 1999], the chain will forget its initial state and will converge to a stationary distribution, which is denoted by Ψ \Psi Ψ.
在某些条件下[Robert and Casella,1999],链将忘记其初始状态,并将收敛到一个固定分布,用 Ψ \Psi Ψ表示。Markov Chain的无记忆性

As the sequence grows larger, the sample points X t X_t Xt become dependent samples from Ψ \Psi Ψ.
随着序列的增大,样本点$ X_t 的 分 布 取 决 于 的分布取决于 \Psi$。如果Markov Chain收敛到了平稳分布,则未来时刻的随机变量都为同一个分布

The reader interested in knowing the conditions under which this happens and for associated proofs of convergence to the stationary distribution is urged to read the references given in Section 14.7.
有兴趣了解这种情况发生的条件以及对平稳分布收敛的相关证明的读者,请阅读第14.7节中给出的参考。


Say the chain has been run for m iterations, and we can assume that the sample points X t , t = m + 1 , . . . , n X_t, t=m+1,...,n Xt,t=m+1,...,n are distributed according to the stationary distribution Ψ \Psi Ψ.
假设链已经运行了m次迭代,我们可以假设采样点 X t , t = m + 1 , . . . , n X_t, t=m+1,...,n Xt,t=m+1,...,n是根据固定分布 Ψ \Psi Ψ分布的。假定Markov Chain收敛到了平稳分布

We can discard the first m m m iterations and use the remaining n − m n-m nm samples along with Equation 14.4 to get an estimate of the expectation as follows
我们可以舍弃最初的$ m 迭 代 , 并 使 用 其 余 的 迭代,并使用其余的 使 n-m $样本以及公式14.4来获得期望的估计值,如下所示假定Markov Chain收敛到了平稳分布,则可以扔掉前m次,只算后面平稳分布的期望
E [ f ( X ) ] ≈ 1 n − m ∑ t = m + 1 n f ( X t ) (14.5) E[f(X)] \approx \frac{1}{n-m} \sum_{t=m+1}^{n} f(X_t) \tag{14.5} E[f(X)]nm1t=m+1nf(Xt)(14.5)
The number of samples m that are discarded is called the burn-in.
丢弃的样本数量 m m m被称为燃烧期—>李航《统计学习方法》

The size of the burn-in period is the subject of current research in MCMC methods.
燃烧期的大小是当前MCMC方法研究的主题。

Diagnostic methods to help determine m m m and n n n are described in Section 14.5.
第14.5节介绍了有助于确定$ m 和 和 n $的诊断方法。

Geyer [1992] suggests that the burn-in can be between 1% and 2% of n n n, where n n n is large enough to obtain adequate precision in the estimate given by Equation 14.5.
Geyer [1992]认为燃烧期可以在$ n 的 1 % 和 2 % 之 间 , 其 中 的1%和2%之间,其中 12 n $足够大,可以在公式14.5给出的估计中获得足够的精度。


So now we must answer the question: how large should n n n be to get the required precision in the estimate?
因此,现在我们必须回答以下问题:$ n $应该有多大才能获得估计中所需的精度?

As stated previously, estimating the variance of the estimate given by Equation 14.5 is difficult because the samples are not independent.
如前所述,由于样本不是独立的,因此难以估计由公式14.5给出的估计的方差。

One way to determine n n n via simulation is to run several Markov chains in parallel, each with a different starting value.
通过仿真确定$ n $的一种方法是并行运行多个马尔可夫链,每个链都有不同的起始值。

The estimates from Equation 14.5 are compared, and if the variation between them is too great, then the length of the chains should be increased [Gilks, etal., 1996b].
比较了方程14.5中的估计,如果它们之间的差异太大,则应增加链的长度[Gilks等,1996b]。

Other methods are given in Roberts [1996], Raftery and Lewis[1996], and in the general references mentioned in Section 14.7.
Roberts [1996],Raftery和Lewis [1996]提到的一般参考文献中给出了其他方法,在第14.7节中。

Analyzing the output

输出分析:看经过 m m m步之后是否平稳


We now discuss how the output from the Markov chains can be used in statistical analysis.
现在我们讨论如何将马尔可夫链的输出用于统计分析。

An analyst might be interested in calculating means, standard deviations, correlations, and marginal distributions for components of X X X.
分析人员可能会对计算$ X $的均值,标准偏差,相关性和边际分布感兴趣。

If we let X t , j X_{t,j} Xt,j represent the j j j-th component of X t X_t Xt, at the t t t-th step in the chain, then using Equation 14.5, we can obtain the marginal means and variances from
如果让$ X_ {t,j} 代 表 代表 X_t 的 第 j 个 分 量 , 则 在 链 中 的 第 的第j个分量,则在链中的第 jt$步处,使用公式14.5,我们可以从以下公式获得边际均值和方差:
X ‾ , j = 1 n − m ∑ t = m + 1 n X t , j \overline{X}_{,j} =\frac{1}{n-m} \sum_{t=m+1}^{n} X_{t,j} X,j=nm1t=m+1nXt,j
and
S , j 2 = 1 n − m + 1 ∑ t = m + 1 n ( X t , j − X ‾ , j ) 2 S^2_{,j} = \frac{1}{n-m+1} \sum_{t=m+1}^{n} (X_{t,j} - \overline{X}_{,j})^2 S,j2=nm+11t=m+1n(Xt,jX,j)2
These estimates are simply the componentwise sample mean and sample variance of the sample points X t , t = m + 1 , . . . , n X_t,t=m+1,...,n Xt,t=m+1,...,n.
这些估计值只是样本点 X t , t = m + 1 , . . . , n X_t,t=m+1,...,n Xt,t=m+1,...,n的分量样本均值和样本方差。

Sample correlations are obtained similarly.
类似地获得样本相关性。

Estimates of the marginal distributions can be obtained using the techniques of Chapter 8.
边际分布的估计值可以使用第8章的技术获得。


One last problem we must deal with to make Markov chains useful is the stationary distribution Ψ \Psi Ψ.
为了使马尔可夫链有用,我们必须处理的最后一个问题是平稳分布 Ψ \Psi Ψ

We need the ability to construct chains such that the stationary distribution of the chain is the one we are interested in: π ( x ) \pi(x) π(x).
我们需要构建链的能力,以使链的静态分布成为我们感兴趣的分布: π ( x ) \pi(x) π(x)

In the MCMC literature, π ( x ) \pi(x) π(x) is often referred to as the target distribution.
在MCMC文献中, π ( x ) \pi(x) π(x)通常被称为目标分布。

It turns out that this is not difficult and is the subject of the next two sections.
事实证明,这并不困难,并且是接下来两节的主题。

Metropolis-Hasting Algorithms

The Metropolis-Hastings method is a generalization of the Metropolis technique of Metropolis, et al. [1953], which had been used for many years in the physics community.
Metropolis-Hastings方法是Metropolis[1953]等人的技术的概括,,已经在物理学界使用了很多年。

The paper by Hastings [1970] further generalized the technique in the context of statistics.
Hastings [1970]的论文在统计学的背景下进一步推广了该技术。

The Metropolis sampler, the independence sampler and the random-walk are all special cases of the Metropolis-Hastings method.
Metropolis采样器,独立采样器和随机游走都是Metropolis-Hastings方法的特例。

Thus, we cover the general method first,followed by the special cases.
因此,我们先介绍一般方法,然后再介绍特殊情况。

These methods share several properties, but one of the more useful properties is that they can be used in applications where π ( x ) \pi(x) π(x) is known up to the constant of proportionality.
这些方法具有多个属性,但是更有用的属性之一是,它们可以用于在已知比例常数之前已知 π ( x ) \pi(x) π(x)的应用程序中。

Another property that makes them useful in a lot of applications is that the analyst does not have to know the conditional distributions, which is the case with the Gibbs sampler.
使它们在许多应用程序中有用的另一个属性是,分析师不必知道条件分布,而Gibbs采样器就是这种情况。

While it can be shown that the Gibbs sampler is a special case of the Metropolis-Hastings algorithm [Robert and Casella, 1999], we include it in the next section because of this difference.
虽然可以证明Gibbs采样器是Metropolis-Hastings算法的特例[Robert and Casella,1999],但由于存在这种差异,我们将其包含在下一部分中。

Metropolis-Hastings Sampler

Metropolis-Hastings采样器


The Metropolis-Hastings sampler obtains the state of the chain at t + 1 t+1 t+1 by sampling a candidate point Υ \Upsilon Υ from a proposal distribution q ( . ∣ X t ) q(.|X_t) q(.Xt).
Metropolis-Hastings采样器通过从提议分布 q ( . ∣ X t ) q(.|X_t) q(.Xt)中对候选点 Υ \Upsilon Υ进行采样来获得$ t + 1 $的链状状态。

Note that this depends only on the previous state X t X_t Xt, and can have any form, subject to regularity conditions [Roberts,1996].
注意,这仅取决于先前的状态$ X_t $,并且可以具有任何形式,但要遵循规则性条件[Roberts,1996]。

An example for q ( . ∣ X t ) q(.|X_t) q(.Xt) is the multivariate normal with mean X t X_t Xt, and fixed covariance matrix.
q ( . ∣ X t ) q(.|X_t) q(.Xt)的一个示例是多元正态,均值为 X t X_t Xt,固定协方差矩阵。

One thing to keep in mind when selecting q ( . ∣ X t ) q(.|X_t) q(.Xt) is that the proposal distribution should be easy to sample from.
选择 q ( . ∣ X t ) q(.|X_t) q(.Xt)时要记住的一件事是,提议分布应该易于抽样。


The required regularity conditions for q ( . ∣ X t ) q(.|X_t) q(.Xt) are irreducibility and aperiodicity [Chib and Greenberg, 1995].
q ( . ∣ X t ) q(.|X_t) q(.Xt)所需的正则条件是不可约和非周期性[Chib and Greenberg,1995]。

Irreducibility means that there is a positive probability that the Markov chain can reach any nonempty set from all starting points.
不可约性意味着马尔可夫链有可能从所有起点到达任何非空集。

Aperiodicity ensures that the chain will not oscillate between different sets of states.
非周期性确保链不会在不同状态集之间振荡。

These conditions are usually satisfied if the proposal distribution has a positive density on the same support as the target distribution.
如果提议分布在与目标分布相同的支持下具有正密度,则通常可以满足这些条件。

They can also be satisfied when the target distribution has a restricted support.
当目标分配的支持有限时,也可以满足它们。

For example, one could use a uniform distribution around the current point in the chain.
例如,可以使用链中当前点周围的均匀分布。

The candidate point is accepted as the next state of the chain with probability given by
候选点被接受为链的下一个状态,概率为
α ( X t , Υ ) = min ⁡ { 1 , π ( Υ ) q ( X t ∣ Υ ) π ( X t ) q ( Υ ∣ X t ) } (14.6) \alpha(X_t,\Upsilon) = \min \left\{1, \frac{\pi(\Upsilon) q(X_t|\Upsilon)}{\pi(X_t) q(\Upsilon|X_t)}\right\} \tag{14.6} α(Xt,Υ)=min{1,π(Xt)q(ΥXt)π(Υ)q(XtΥ)}(14.6)
If the point Υ \Upsilon Υ is not accepted, then the chain does not move and X t + 1 = X t X_{t+1}=X_t Xt+1=Xt.
如果不接受点 Υ \Upsilon Υ,则链条不会移动,并且$ X_ {t + 1} = X_t $。

The steps of the algorithm are outlined next.
接下来概述该算法的步骤。

It is important to note that the distribution of interest π ( x ) \pi(x) π(x) appears as a ratio, so the constant of proportionality cancels out.
重要的是要注意,利息分配 π ( x ) \pi(x) π(x)是以比率显示的,因此比例常数会抵消。

This is one of the appealing characteristics of the Metropolis-Hastings sampler, making it appropriate for a wide variety of applications.
这是Metropolis-Hastings采样器的吸引人的特征之一,使其适用于多种应用。

PROCEDURE - METROPOLIS-HASTINEGS SAMPLER

  1. Initialize the chain to x 0 x_0 x0 and set t = 0 t=0 t=0.
    将链初始化为$ x_0 并 设 置 并设置 t = 0 $。

  2. Generate a candidate point Υ \Upsilon Υ from q ( . ∣ X t ) q(.|X_t) q(.Xt).
    从$ q(.| X_t) 生 成 候 选 点 生成候选点 \Upsilon$。

  3. Generate U U U from a uniform ( 0 , 1 ) (0,1) (0,1) distribution.
    从均匀的 ( 0 , 1 ) (0,1) (0,1)分布生成$ U $。

  4. If U ≤ α ( X t , Υ ) U \le \alpha(X_t,\Upsilon) Uα(Xt,Υ), then set X t + 1 = Υ X_{t+1}=\Upsilon Xt+1=Υ

    如果 U ≤ α ( X t , Υ ) U \le \alpha(X_t,\Upsilon) Uα(Xt,Υ),则设置 X t + 1 = Υ X_{t+1}=\Upsilon Xt+1=Υ
    else set X t + 1 = X t X_{t+1}=X_t Xt+1=Xt.
    否则设置 X t + 1 = X t X_{t+1}=X_t Xt+1=Xt

  5. Set t = t + 1 t=t+1 t=t+1 and repeat steps 2 through 5.
    设置$ t = t + 1 $并重复步骤2至5。

The Metropolis-Hastings procedure is implemented in Example 14.2, where we use it to generate random variables from a standard Cauchy distribution.
Metropolis-Hastings过程在示例14.2中实现,在此过程中,我们使用该过程从标准柯西分布中生成随机变量。

As we will see, this implementation is one of the special cases of the Metropolis-Hastings sampler described later.
我们将看到,此实现是稍后描述的Metropolis-Hastings采样器的特殊情况之一。

Example 14.2

We show how the Metropolis-Hastings sampler can be used to generate random variables from a standard Cauchy distribution given by
我们展示了如何使用Metropolis-Hastings采样器从以下公式给出的标准柯西分布中生成随机变量
f ( x ) = 1 π ( 1 + x 2 ) − ∞ < x < ∞ f(x) = \frac{1}{\pi(1+x^2)} \quad - \infty <x< \infty f(x)=π(1+x2)1<x<
From this, we see that
f ( x ) ∝ 1 1 + x 2 f(x) \propto \frac{1}{1+x^2} f(x)1+x21
We will use the normal as our proposal distribution, with a mean given by σ \sigma σ.
我们将使用正态作为提议分布,均值由 σ \sigma σ给出。

We start by setting up inline MATLAB functions to evaluate the densities for Equation 14.6.
我们首先设置inline MATLAB函数以评估方程14.6的密度。

% Set up an inline function to evaluate the Cauchy.
% Note that in both of the functions, 
% the constants are canceled.
strg = '1./(1+x.^2)';  % 柯西分布
cauchy = inline(strg,'x');
% set up an inline function to evaluate the Normal pdf
strg = '1/sig*exp(-0.5*((x-mu)/sig).^2)'; % 正态分布
norm = inline(strg,'x','mu','sig');

We now generate n = 10000 n=10000 n=10000 samples in the chain.
在链中生成$ n = 10000 $个样本。

% Generate 10000 samples in the chain.
% Set up the constants.
n = 10000; %一万个数据
sig = 2;
x = zeros(1,n); % 分配内存
% generate the starting point
% 初始值,理论上初始值对MH链没有影响
x(1) = randn(1);

for i = 2:n
	% generate a candidate from the proposal distribution
	% which is the normal in this case. This will be a 
	% normal with mean given by the previous value in the
	% chain and standard deviation of 'sig'
	y = x(i-1) + sig*randn(1); % 用正态分布换算
	% generate a uniform for comparison
	u = rand(1);
	alpha = min([1, cauchy(y)*norm(x(i-1),y,sig)/...
       (cauchy(x(i-1))*norm(y,x(i-1),sig))]);
	if u <= alpha
		x(i) = y;
	else
		x(i) = x(i-1);
	end
end  

We can plot a density histogram along with the curve corresponding to the true probability density function.
我们可以绘制密度直方图以及对应于真实概率密度函数的曲线。

We discard the first 500 points for the burn-in period. The plot is shown in Figure 14.1.
我们将在预试期间舍弃前500点,该图如图14.1所示。

[N,h]=hist(x);
bar(h,N/(n * (h(2)- h(1))),1,'w')
hold on
% xi = -30:2.5:30;
% yi = ksdensity(x,xi);
% bar(xi,yi,'w'); 
f = @(x) 1./(pi*(1+x.^2));
fplot(f,[-30,30],'k-')
hold off
image-20201219162648729
刚开始可能并不是这么完美,可能有偏差,多运行几次

Metropolis Sampler

The Gibbs Sampler

Although the Gibbs sampler can be shown to be a special case of the Metropolis-Hastings algorithm [Gilks, et al., 1996b; Robert and Casella,1999], we include it in its own section because it is different in some fundamental ways.
尽管吉布斯采样器可以证明是Metropolis-Hastings算法的一种特例 [Gilks等,1996b; Robert and Casella,1999],我们将其包含在其单独的部分中,因为它在某些基本方式上有所不同。

The two main differences between the Gibbs sampler and Metropolis-Hastings are
Gibbs采样器和Metropolis-Hastings之间的两个主要区别是

  1. We always accept a candidate point.
    我们总是接受一个候选点。
  2. We must know the full conditional distributions.
    我们必须知道完整的条件分布。

In general, the fact that we must know the full conditional distributions makes the algorithm less applicable.
通常,必须知道全部条件分布的事实使该算法不太适用。

问题的引出:由联合分布求求边际分布
f ( x ) = ∫ . . . ∫ f ( x , y 1 , y 2 , . . . , y d ) d y 1 d y 2 . . . d y d (14.10) f(x) = \int ... \int f(x,y_1,y_2,...,y_d) dy_1 dy_2 ... dy_d \tag{14.10} f(x)=...f(x,y1,y2,...,yd)dy1dy2...dyd(14.10)
如果直接求解这个积分,由于要算 d d d维积分,使得算法太复杂,所以引入了Gibbs来简化积分的运算。

用Gibbs抽样器从 f ( x ) f(x) f(x) 中产生 X 1 , . . . , X m X_1,...,X_m X1,...,Xm 个样本,用这些样本去估计想要的关于 f ( x ) f(x) f(x) 的特征。

运用Gibbs抽样抽取二维随机变量,联合分布为 f ( x ) f(x) f(x),随机变量为 X t = ( X t , 1 , X t , 2 ) X_t = (X_{t,1},X_{t,2}) Xt=(Xt,1,Xt,2).

选取一个初始值,令 t = 0 t=0 t=0,则随机变量为 X 0 = ( X 0 , 1 , X 0 , 2 ) X_0 = (X_{0,1},X_{0,2}) X0=(X0,1,X0,2) .

f ( x 1 , x 2 ) f(x_1,x_2) f(x1,x2)中产生样本,首先从条件分布 f ( x 1 ∣ x 2 ) f(x_1|x_2) f(x1x2) f ( x 2 ∣ x 1 ) f(x_2|x_1) f(x2x1) 中产生样本,目的是为了固定住一个随机变量,方便抽取其余的随机变量。

At each iteration, the elements of the random vector are obtained one at a time by alternately generating values from the conditional distributions.
在每次迭代中,通过从条件分布交替生成值来一次获得随机向量的元素。

PROCEDURE - GIBBS SAMPLER (BIVARIATE CASE)

  1. Generate a starting point X 0 = ( X 0 , 1 , X 0 , 2 ) X_0 = (X_{0,1},X_{0,2}) X0=(X0,1,X0,2). Set t = 0 t=0 t=0.
    生成起点 X 0 = ( X 0 , 1 , X 0 , 2 ) X_0 = (X_{0,1},X_{0,2}) X0=(X0,1,X0,2)。设置$ t = 0 $。
  2. Generate a point X t , 1 X_{t,1} Xt,1 from f ( X t , 1 ∣ X t , 2 = x t , 2 ) f(X_{t,1}|X_{t,2}=x_{t,2}) f(Xt,1Xt,2=xt,2)
    f ( X t , 1 ∣ X t , 2 = x t , 2 ) f(X_{t,1}|X_{t,2}=x_{t,2}) f(Xt,1Xt,2=xt,2)生成点 X t , 1 X_{t,1} Xt,1
  3. Generate a point X t , 2 X_{t,2} Xt,2 from f ( X t , 2 ∣ X t + 1 , 1 = x t + 1 , 1 ) f(X_{t,2}|X_{t+1,1}=x_{t+1,1}) f(Xt,2Xt+1,1=xt+1,1)
    f ( X t , 2 ∣ X t + 1 , 1 = x t + 1 , 1 ) f(X_{t,2}|X_{t+1,1}=x_{t+1,1}) f(Xt,2Xt+1,1=xt+1,1)生成点 X t , 2 X_{t,2} Xt,2
  4. Set t = t + 1 t=t+1 t=t+1 and repeat steps 2 through 4.
    设置$ t = t + 1 $并重复步骤2至4。

Note that the conditional distributions are conditioned on the current or most recent values of the other components of X t X_t Xt.
请注意,条件分布取决于$ X_t $其他随机变量的当前值或最新值。

Example 14.6 shows how this is done in a simple case taken from Casella and George [1992].

Example 14.6

联合分布 f ( x , y ) ∝ ( n x ) y x + α − 1 ( 1 − y ) n − x + β − 1 f(x, y) \propto\left(\begin{array}{l}n \\ x\end{array}\right) y^{x+\alpha-1}(1-y)^{n-x+\beta-1} f(x,y)(nx)yx+α1(1y)nx+β1 ,这是一个beta分布, x = 0 , 1 , . . . , n 0 ≤ y ≤ 1 x=0,1,...,n \quad 0\le y \le 1 x=0,1,...,n0y1

条件分布 f ( x ∣ y ) ∼ B ( n , y ) f(x|y)\sim B(n,y) f(xy)B(n,y)

条件分布 f ( y ∣ x ) ∼ B e t a ( x + α , n − x + β ) f(y|x)\sim Beta(x+\alpha,n-x+\beta) f(yx)Beta(x+α,nx+β)

% Set up preliminaries.
% Here we use k for the chain length, because n 
% is used for the number of trials in a binomial.
k = 1000; % generate a chain of size 1000
m = 500;  % burn-in will be 500
a = 2;	 % chosen
b = 4;
x = zeros(1,k);
y = zeros(1,k);
n = 16;

接下来的是第二步和第三步,不断重复

% Pick a starting point.
% 给第一个分量的值为二项分布
x(1) = binornd(n,0.5,1,1);
% 给第二个分量的值为beta分布
y(1) = betarnd(x(1) + a, n - x(1) + b,1,1);
重复1000次
for i = 2:k
		x(i) = binornd(n,y(i-1),1,1);
		% x 抽取之后,beta分布就已知了
		% 所以抽取y的时候就可以
		% 直接从beta分布里面抽取
		y(i) = betarnd(x(i)+a, n-x(i)+b, 1, 1);
end

不用担心是否要接受或者拒绝,因为Gibbs总是接受,边际密度函数的估计用以下式子扔掉前m次
f ^ ( x ) = 1 k − m ∑ i = m + 1 k f ( x ∣ y i ) \hat{f}(x) = \frac{1}{k-m} \sum_{i=m+1}^{k} f(x|y_i) f^(x)=km1i=m+1kf(xyi)

% Get the marginal by evaluating the conditional.
% Use MATLAB's Statistics Toolbox.
% Find the P(X=x|Y's)
fhat = zeros(1,17);
for i = 1:17
		fhat(i) = mean(binopdf(i-1,n,y(500:k)));
end
  • 为啥这里的是17次,为啥是i-1?

17是因为MATLAB计数从1开始

因为x的边际分布是二项分布,n=16,对于每一个不同的n,就有一个不同的分布。

这里的fhat(i)中i的含义是估计的第i个边际分布。

PROCEDURE - GIBBS SAMPLER

  1. Generate a starting point X 0 = ( X 0 , 1 , X 0 , 2 , . . . , X 0 , d ) X_0 = (X_{0,1},X_{0,2},...,X_{0,d}) X0=(X0,1,X0,2,...,X0,d). Set t = 0 t=0 t=0.
    生成起点 X 0 = ( X 0 , 1 , X 0 , 2 , . . . , X 0 , d ) X_0 = (X_{0,1},X_{0,2},...,X_{0,d}) X0=(X0,1,X0,2,...,X0,d)。设置$ t = 0 $。
  2. Generate a point X t , 1 X_{t,1} Xt,1 from f ( X ( t + 1 ) , 1 ∣ X t , 2 = x t , 2 , . . . , X t , d = x t , d ) f(X_{(t+1),1}|X_{t,2}=x_{t,2},...,X_{t,d}=x_{t,d}) f(X(t+1),1Xt,2=xt,2,...,Xt,d=xt,d).
    Generate a point X ( t + 1 ) , 2 X_{(t+1),2} X(t+1),2 from f ( X ( t + 1 ) , 1 ∣ X t + 1 , 1 = x t + 1 , 1 , X t , 3 = x t , 3 , . . . , X t , d = x t , d ) f(X_{(t+1),1}|X_{t+1,1}=x_{t+1,1},X_{t,3}=x_{t,3},...,X_{t,d}=x_{t,d}) f(X(t+1),1Xt+1,1=xt+1,1,Xt,3=xt,3,...,Xt,d=xt,d).

    Generate a point X ( t + 1 ) , d X_{(t+1),d} X(t+1),d from f ( X ( t + 1 ) , d ∣ X t + 1 , 1 = x t + 1 , 1 , . . . , X t + 1 , d − 1 = x t + 1 , d − 1 ) f(X_{(t+1),d}|X_{t+1,1}=x_{t+1,1},...,X_{t+1,d-1}=x_{t+1,d-1}) f(X(t+1),dXt+1,1=xt+1,1,...,Xt+1,d1=xt+1,d1).
  3. Set t = t + 1 t=t+1 t=t+1 and repeat steps 2 through 3.


从下列公式可以看出,Gibbs抽样是用的坐标更新的方法,依次更新坐标。

x 1 , 1 ∼ f ( X 0 , 1 ∣ X 0 , 2 = x 0 , 2 , . . . , X 0 , d = x 0 , d ) x_{1,1} \sim f(X_{0,1} | X_{0,2} = x_{0,2},...,X_{0,d} = x_{0,d}) x1,1f(X0,1X0,2=x0,2,...,X0,d=x0,d)

x 2 , 1 ∼ f ( X 0 , 2 ∣ X 1 , 1 = x 1 , 1 , X 0 , 3 = x 0 , 3 , . . . , X 0 , d = x 0 , d ) x_{2,1} \sim f(X_{0,2} | X_{1,1} = x_{1,1},X_{0,3} = x_{0,3},...,X_{0,d} = x_{0,d}) x2,1f(X0,2X1,1=x1,1,X0,3=x0,3,...,X0,d=x0,d)

x d , 1 ∼ f ( X 0 , d ∣ X 1 , 1 = x 1 , 1 , X 2 , 1 = x 2 , 1 , . . . , X d − 1 , 1 = x d − 1 , 1 ) x_{d,1} \sim f(X_{0,d} | X_{1,1} = x_{1,1},X_{2,1} = x_{2,1},...,X_{d-1,1} = x_{d-1,1}) xd,1f(X0,dX1,1=x1,1,X2,1=x2,1,...,Xd1,1=xd1,1)

例 14.7 二维正态
μ = [ μ 1 μ 2 ] = [ 1 2 ] Σ = [ 1 ρ ρ 1 ] = [ 1 0.9 0.9 1 ] \mu=\left[\begin{array}{l} \mu_{1} \\ \mu_{2} \end{array}\right]=\left[\begin{array}{l} 1 \\ 2 \end{array}\right] \quad \Sigma=\left[\begin{array}{ll} 1 & \rho \\ \rho & 1 \end{array}\right]=\left[\begin{array}{cc} 1 & 0.9 \\ 0.9 & 1 \end{array}\right] μ=[μ1μ2]=[12]Σ=[1ρρ1]=[10.90.91]
边际分布 f ( x 1 ∣ x 2 ) ∼ N ( μ 1 + ρ ( x 2 − μ 2 ) , ( 1 − ρ ) 2 ) f(x_1|x_2)\sim N(\mu_1+\rho (x_2 - \mu_2), (1-\rho)^2) f(x1x2)N(μ1+ρ(x2μ2),(1ρ)2)

边际分布 f ( x 2 ∣ x 1 ) ∼ N ( μ 2 + ρ ( x 1 − μ 1 ) , ( 1 − ρ ) 2 ) f(x_2|x_1)\sim N(\mu_2+\rho (x_1 - \mu_1), (1-\rho)^2) f(x2x1)N(μ2+ρ(x1μ1),(1ρ)2)

% Set up constants and arrays.
n = 6000;
xgibbs = zeros(n,2);							
rho = 0.9;
y = [1;2]; % This is the mean.
sig = sqrt(1-rho^2);								
% Initial point.
xgibbs(1,:) = [10 10];
% Start the chain.
for i = 2:n
   mu = y(1) + rho*(xgibbs(i-1,2)-y(2));
   % 边际分布为一元正态
   xgibbs(i,1) = mu + sig*randn(1); 
   mu = y(2) + rho*(xgibbs(i,1) - y(1));
   % 边际分布为一元正态,且需要用到上一次更新的值
   xgibbs(i,2) = mu + sig*randn(1);
end
scatter(xgibbs(:,1),xgibbs(:,2))
image-20201219174236859
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值