机器学习学习笔记 PRML Chapter 2.0 : Prerequisite之Sufficient Statistics

Chapter 2.0 : Prerequisite 1 - Sufficient Statistics

PRML, OXford University Deep Learning Course, Machine Learning, Pattern Recognition
Christopher M. Bishop, PRML, Chapter 2 Probability Distributions

1. Introduction

In the process of estimating parameters, we summarize, or reduce, the information in a sample of size n , {X1,X2,...,Xn}, to a single number, such as the sample mean X¯ . The actual sample values are no longer important to us. That is, if we use a sample mean of 3 to estimate the population mean μ, it doesn’t matter if the original data values were (1,3,5) or (2,3,4) .

Problems:
- Has this process of reducing the n data points to a single number retained all of the information about μ that was contained in the original n data points?
- Or has some information about the parameter been lost through the process of summarizing the data?

In this lesson, we’ll learn how to find statistics that summarize all of the information in a sample about the desired parameter. Such statistics are called sufficient statistics.

2. Definition of Sufficiency

2.1 Definition:

Let X1,X2,...,Xn be a random sample from a probability distribution with unknown parameter θ . Then, the statistic Y=u(X1,X2,...,Xn) is said to be sufficient for θ if the conditional distribution of X1,X2,...,Xn , given the statistic Y , i.e., p(X1=x1,X2=x2,...,Xn=xnY=y) does not depend on the parameter θ .
- Why called “sufficient”?
- We say that Y is sufficient for θ, since once the value of Y is known (即,有了value of Y (i.e., given Y=y ) 就足以获取了关于未知参数 θ 的全部可用信息), 并且no other function of X1,X2,...,Xn will provide any additional information about the possible value of θ .
- Sufficiency means that if we know the value of Y , we cannot gain any further information about the parameter θ by considering other functions of the data X1,X2,...,Xn .

2.2 Example 1 - Binomial Distribution:

Consider Bernoulli trials:

Let X1,X2,...,Xn be a random sample of n Bernoulli trials in which the success has the probability p, and the fail with 1p , i.e, P(Xi=1)=p , and P(Xi=0)=q=1p , for i=1,2,...,n . Suppose, in a random sample of n=40 , that success events occur Y=ni=1Xi=22 in total. If we know the value of Y , the number of successes in n trials, can we gain any further information about the parameter p by considering other functions of the data X1,X2,...,Xn? Or equivalently is Y sufficient for p?

Solution:

The definition of sufficiency tells us that if the conditional distribution of X1,X2,...,Xn , given the statistic Y , does not depend on p, then Y is said to be a sufficient statistic for the unknown parameter p. The conditional distribution of X1,X2,...,Xn , given Y , is given by:

p(X1=x1,X2=x2,...,Xn=xnY=y)=P(X1=x1,X2=x2,...,Xn=xn,Y=y)P(Y=y)(2.1)

Now, for the sake of concreteness, suppose we were to observe a random sample of size n=3 in which x1=1,x2=0,x3=1 . In this case:

P(X1=1,X2=0,X3=1,Y=1)=0

because (Y=1)(ni=1Xi=1+0+1=2) , corresponding to an impossible event in the numerator of (2.1) therefore with its probability being 0.

Now, let’s consider an event that is possible, namely (X1=1,X2=0,X3=1,Y=2) . In that case, we have, by independence:

P(X1=1,X2=0,X3=1,Y=2)=p(1p)p=p2(1p)

So, in general:

P(X1=x1,X2=x2,...,Xn=xn,Y=y)=0,ifi=1nXiy

and
P(X1=x1,X2=x2,...,Xn=xn,Y=y)=py(1p)ny,ifi=1nXi=y

Now, the denominator in (2.1) is the binomial probability of getting exactly y successes in n trials with a probability of success p . That is, the denominator is:

P(Y=y)=(ny)py(1p)ny
for y=0,1,2,...,n .

Putting the numerator and denominator together, we get

P(X1=x1,X2=x2,...,Xn=xnY=y)=1(ny),ifi=1nXi=y

and
P(X1=x1,X2=x2,...,Xn=xn,Y=y)=0,ifi=1nXiy

Conclusion 1:

We have just shown that the conditional distribution of X1,X2,...,Xn given Y does not depend on p. Therefore, Y is indeed sufficient for p. That is, once the value of Y is known, no other function of X1,X2,...,Xn will provide any additional information about the possible value of p .

3. Factorization Theorem

3.1 We need more easy method to identify sufficiency:

While the definition of sufficiency may make sense intuitively, it is not always all that easy to find the conditional distribution of X1,X2,...,Xn given Y . Not to mention that we’d have to find the conditional distribution of X1,X2,...,Xn given Y for every Y that we’d want to consider a possible sufficient statistic! Therefore, using the formal definition of sufficiency as a way of identifying a sufficient statistic for a parameter θ can often be a daunting road to follow. Thankfully, a theorem often referred to as the Factorization Theorem provides an easier alternative!

3.2 Factorization Theorem:

Let X1,X2,...,Xn denote random variables with joint probability density function or joint probability mass function f(x1,x2,...,xnθ) , which depends on the parameter θ . Then, the statistic Y=u(X1,X2,...,Xn) is sufficient for θ if and only if the p.d.f (or p.m.f.) can be factored into two components, that is:

f(x1,x2,...,xnθ)=ϕ[u(x1,x2,...,xn)θ]h(x1,x2,...,xn)

where:
- ϕ is a function that depends on the data x1,x2,...,xn only through the function y=u(x1,x2,...,xn) , and
- the function h(x1,x2,...,xn) does not depend on the parameter θ .

3.3 Example 2 - Poisson Distribution:

这里写图片描述

Recall that the mathematical constant e is the unique real number such that the value of the derivative (slope of the tangent line) of the function f(x)=ex at the point x=0 is equal to 1 . It turns out that the constant is irrational, but to five decimal places, it equals e=2.71828. Also, note that there are (theoretically) an infinite number of possible Poisson distributions. Any specific Poisson distribution depends on the parameter λ .

Let X1,X2,...,Xn denote a random sample from a Poisson distribution with parameter λ>0 . Find a sufficient statistic for the parameter λ .

Solution:

Because X1,X2,...,Xn is a random sample, the joint probability mass function of X1,X2,...,Xn is, by independence:

f(x1,x2,...,xnλ)=f(x1λ)f(x2λ)f(xnλ)=eλλx1x1!eλλx2x2!eλλxnxn!=(enλλxi)(1x1!x2!xn!)

Hey, look at that! We just factored the joint p.m.f. into two functions, one ( ϕ ) being only a function of the statistic Y=ni=1Xi and the other ( h ) not depending on the parameter λ:
这里写图片描述

We can also write the joint p.m.f. as:

f(x1,x2,...,xnλ)=(enλλnx¯)(1x1!x2!xn!)

Therefore, the Factorization Theorem tells us that Y=X¯ is also a sufficient statistic for λ .

If you think about it, it makes sense that Y=X¯ and Y=ni=1Xi are both sufficient statistics, because if we know Y=X¯ , we can easily find Y=ni=1Xi , and vice verse.

Conclusion 2:

There can be more than one sufficient statistic for a parameter θ . In general, if Y is a sufficient statistic for a parameter θ, then every one-to-one function of Y not involving θ is also a sufficient statistic for θ .

3.4 Example 3 - Gaussian Distribution N(μ,1) :

Let X1,X2,...,Xn be a random sample from a normal distribution with mean μ and variance σ=1 . Find a sufficient statistic for the parameter μ .

Solution:

For i.i.d. data X1,X2,...,Xn , the joint probability density function of X1,X2,...,Xn is

f(x1,x2,...,xnμ)=f(x1μ)×f(x2μ)××f(xnμ)=1(2π)1/2exp[12(x1μ)2]×1(2π)1/2exp[12(x2μ)2]××1(2π)1/2exp[12(xnμ)2]=1(2π)n/2exp[12i=1n(xiμ)2]

A trick to making the factoring of the joint p.d.f. an easier task is to add 0 to the quantity in parentheses in the summation. That is:
这里写图片描述
Now, squaring the quantity in parentheses, we get:

f(x1,x2,...,xn;μ)=1(2π)n/2exp[12i=1n[(xix¯)2+2(xix¯)(x¯μ)+(x¯μ)2]]

And then distributing the summation, we get:

f(x1,x2,...,xn;μ)=1(2π)n/2exp[12i=1n(xix¯)2(x¯μ)i=1n(xix¯)12i=1n(x¯μ)2]

But, the middle term in the exponent is 0 , and the last term, because it doesn’t depend on the index i, can be added up n times:
这里写图片描述

So, simplifying, we get:

f(x1,x2,...,xn;μ)={exp[n2(x¯μ)2]}×{1(2π)n/2exp[12i=1n(xix¯)2]}

In summary, we have factored the joint p.d.f. into two functions, one ( ϕ ) being only a function of the statistic Y=X¯ and the other ( h ) not depending on the parameter μ:
这里写图片描述

Conclusion 3:
  • Therefore, the Factorization Theorem tells us that Y=X¯=1nni=1Xi is a sufficient statistic for μ .
  • Now, Y=X¯3 is also sufficient for μ , because if we are given the value of X¯3 , we can easily get the value of X¯ through the one-to-one function w=y1/3 , that is W=(X¯3)1/3=X¯ .
  • However, Y=X¯2 is not a sufficient statistic for μ , because it is not a one-to-one function, with both +X¯ and X¯ mapped to X¯2 .

3.5 Example 4 - Exponential Distribution:

Let X1,X2,...,Xn be a random sample from an exponential distribution with parameter θ . Find a sufficient statistic for the parameter θ .

Solution:

The joint probability density function of X1,X2,...,Xn is, by independence:

f(x1,x2,...,xn;θ)=f(x1;θ)×f(x2;θ)×...×f(xn;θ)

the joint p.d.f. is:
f(x1,x2,...,xn;θ)=1θexp(x1θ)×1θexp(x2θ)×...×1θexp(xnθ)

Now, simplifying, by adding up all n of the θs and the n xi’s in the exponents, we get:
f(x1,x2,...,xn;θ)=1θnexp(1θi=1nxi)

We have again factored the joint p.d.f. into two functions, one ( ϕ ) being only a function of the statistic Y=ni=1Xi and the other ( h=1 ) not depending on the parameter θ :
这里写图片描述

Conclusion 4:

Therefore, the Factorization Theorem tells us that Y=ni=1Xi is a sufficient statistic for θ . And, since Y=X¯ is a one-to-one function of Y=ni=1Xi , it implies that Y=X¯ is also a sufficient statistic for θ .

4. Exponential Form

4.1 Exponential Form

You might not have noticed that in all of the examples we have considered so far in this lesson, every p.d.f. or p.m.f. could be written in what is often called exponential form, that is:

f(xθ)=exp[K(x)p(θ)+S(x)+q(θ)]

1) Exponential Form of Bernoulli Distribution:

For example, the Bernoulli random variables with p.m.f. is written in exponential form as:
这里写图片描述
with

  • (1) K(x)=x and S(x)=ln(1) being functions only of x ,
  • (2) p(p)=lnp1p and q(p)=ln(1p) being functions only of the parameter p , and
  • (3) the support x=0,1 not depending on the parameter p .

2) Exponential Form of Poisson Distribution:

这里写图片描述
with

  • (1) K(x)=x and S(x)=ln(x!) being functions only of x ,
  • (2) p(λ)=lnλ and q(λ)=λ being functions only of the parameter λ , and
  • (3) the support x=0,1,2, not depending on the parameter λ .
3) Exponential Form of Gaussian Distribution N(μ,1) :

这里写图片描述
with

  • (1) K(x)=x and S(x)=x22 being functions only of x ,
  • (2) p(μ)=μ and q(μ)=μ2212ln(2π) being functions only of the parameter μ , and
  • (3) the support <x< not depending on the parameter μ .
4) Exponential Form of Exponential Distribution:

这里写图片描述
with

  • (1) K(x)=x and S(x)=ln(1)=0 being functions only of x ,
  • (2) p(θ)=1θ and q(θ)=lnθ being functions only of the parameter θ , and
  • (3) the support x0 not depending on the parameter θ .

4.2 Exponential Criterion

It turns out that writing p.d.f.s and p.m.f.s in exponential form provides us yet a third way of identifying sufficient statistics for our parameters. The following theorem tells us how.

Theorem:

Let X1,X2,...,Xn be a random sample from a distribution with a p.d.f. or p.m.f. of the exponential form:

f(xθ)=exp[K(x)p(θ)+S(x)+q(θ)]

with a support that does not depend on θ, that is,
- (1) K(x) and S(x)= being functions only of x ,
- (2) p(θ) and q(θ) being functions only of the parameter θ , and
- (3) the support being free of the parameter θ .

Then, the statistic:

i=1nK(Xi)
is sufficient for θ .

Proof:

f(x1,x2,...,xn;θ)=f(x1;θ)×f(x2;θ)×...×f(xn;θ)

f(x1,...,xn;θ)=exp[K(x1)p(θ)+S(x1)+q(θ)]×...×exp[K(xn)p(θ)+S(xn)+q(θ)]

Collecting like terms in the exponents, we get:
f(x1,...,xn;θ)=exp[p(θ)i=1nK(xi)+i=1nS(xi)+nq(θ)]

which can be factored as:
f(x1,...,xn;θ)={exp[p(θ)i=1nK(xi)+nq(θ)]}×{exp[i=1nS(xi)]}

We have factored the joint p.m.f. or p.d.f. into two functions:
- one ( ϕ ) being only a function of the statistic Y=ni=1K(Xi) and
- the other ( h ) not depending on the parameter θ:
这里写图片描述

Therefore, the Factorization Theorem tells us that Y=ni=1K(Xi) is a sufficient statistic for θ .

4.3 Example 5 - Geometric Distribution:

Let X1,X2,...,Xn be a random sample from a geometric distribution with parameter p . Find a sufficient statistic for the parameter p.

Solution:

The probability mass function of a geometric random variable is:

f(x;p)=(1p)x1p
for x=1,2,3, The p.m.f. can be written in exponential form as
f(x;p)=exp[xlog(1p)+log(1)+log(p1p)]

Conclusion 5:

Therefore, Y=ni=1Xi is sufficient for p . Easy as pie!

5. Two or More Parameters

What happens if a probability distribution has two parameters, θ1 and θ2 , say, for which we want to find sufficient statistics, Y1 and Y2 ? Fortunately, the definitions of sufficiency can easily be extended to accommodate two (or more) parameters. Let’s start by extending the Factorization Theorem.

5.1 Factorization Theorem

这里写图片描述

5.2 Example 6 - Gaussian Distribution N(μ,σ2) :

Let X1,X2,...,Xn denote a random sample from a normal distribution N(θ1,θ1) . That is, θ1 denotes the mean μ and θ2 denotes the variance σ2 . Use the Factorization Theorem to find joint sufficient statistics for θ1 and θ2 .

Solution:

The joint probability density function of X1,X2,...,Xn is, by independence:

f(x1,x2,...,xn;θ1,θ2)=f(x1;θ1,θ2)×f(x2;θ1,θ2)×...×f(xn;θ1,θ2)

Due to the Gaussian pdf

f(xiθ1,θ2)=1(2πθ2)1/2exp[12(xiθ1)2θ2]

We get

f(x1,x2,...,xn;θ1,θ2)=(12πθ2)nexp[12ni=1(xiθ1)2θ2]

Rewriting the first factor, and squaring the quantity in parentheses, and distributing the summation, in the second factor, we get:
f(x1,x2,...,xn;θ1,θ2)=explog(12πθ2)nexp[12θ2{i=1nx2i2θ1i=1nxi+i=1nθ21}]

Simplifying yet more, we get:

f(x1,x2,...,xn;θ1,θ2)=exp[12θ2i=1nx2i+θ1θ2i=1nxinθ212θ2nlog2πθ2]

Look at that! We have factored the joint p.d.f. into two functions, one ( ϕ ) being only a function of the statistic Y1=ni=1X2i and Y2=ni=1Xi , and the other ( h=1 ) not depending on the parameter θ1 and θ2 :
这里写图片描述

Conclusion 6.1:
  • Therefore, the Factorization Theorem tells us that Y1=ni=1X2i and Y2=ni=1Xi are joint sufficient statistics for θ1 and θ2 .
  • And, the one-to-one functions of Y1 and Y2 , namely:
    X¯=Y2n=1ni=1nXi

    and
    S2=Y1(Y22/n)n1=1n1[i=1nX2inX¯2]
    are also joint sufficient statistics for θ1 and θ2 .
  • We have just shown that the intuitive estimators of μ and σ2 are also sufficient estimators. That is, the data contain no more information than the estimators X¯ and S2 do about the parameters μ and σ2 . That seems like a good thing!

5.3 Exponential Criterion

We have just extended the Factorization Theorem. Now, the Exponential Criterion can also be extended to accommodate two (or more) parameters. It is stated here without proof.

Exponential Criterion:
Let X1,X2,...,Xn be a random sample from a distribution with a p.d.f. or p.m.f. of the exponential form:
f(x;θ1,θ2)=exp[K1(x)p1(θ1,θ2)+K2(x)p2(θ1,θ2)+S(x)+q(θ1,θ2)]
with a support that does not depend on the parameters θ1 and θ2 . Then, the statistics Y1=ni=1K1(Xi) and Y2=ni=1K2(Xi) are jointly sufficient for θ1 and θ2 .

5.4 Example 6 - Gaussian Distribution N(μ,σ2) (continued):

Let X1,X2,...,Xn denote a random sample from a normal distribution N(θ1,θ1) . That is, θ1 denotes the mean μ and θ2 denotes the variance σ2 . Use the Exponential Criterion to find joint sufficient statistics for θ1 and θ2 .

Solution:

The probability density function of a normal random variable with mean θ1 and variance θ2 can be written in exponential form as:
这里写图片描述

Conclusion 6.2:

Therefore, the statistics Y1=ni=1X2i and Y2=ni=1Xi are joint sufficient statistics for θ1 and θ2 .

6. Reference

[1]: Lesson 53: Sufficient Statistics (ttps://onlinecourses.science.psu.edu/stat414/print/book/export/html/244)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值