机器学习学习笔记 PRML Chapter 2.0 : Prerequisite之Sufficient Statistics

最新推荐文章于 2021-03-18 23:16:42 发布

GloryOfFamily

最新推荐文章于 2021-03-18 23:16:42 发布

阅读量771

点赞数

分类专栏： machine learning 机器学习 PRML 文章标签：机器学习模式识别 PRML教材

本文链接：https://blog.csdn.net/ccj5351/article/details/51749420

版权

machine learning 同时被 3 个专栏收录

8 篇文章 0 订阅

订阅专栏

机器学习

8 篇文章 0 订阅

订阅专栏

PRML

8 篇文章 0 订阅

订阅专栏

Chapter 2.0 : Prerequisite 1 - Sufficient Statistics

PRML, OXford University Deep Learning Course, Machine Learning, Pattern Recognition
Christopher M. Bishop, PRML, Chapter 2 Probability Distributions

Chapter 20 Prerequisite 1 - Sufficient Statistics

1. Introduction

In the process of estimating parameters, we summarize, or reduce, the information in a sample of size $n$ , $\{X_1, X_2, ..., X_n\}$ , to a single number, such as the sample mean $\bar X$ . The actual sample values are no longer important to us. That is, if we use a sample mean of $3$ to estimate the population mean μ, it doesn’t matter if the original data values were $(1, 3, 5)$ or $(2, 3, 4)$ .

Problems:
- Has this process of reducing the $n$ data points to a single number retained all of the information about $\mu$ that was contained in the original $n$ data points?
- Or has some information about the parameter been lost through the process of summarizing the data?

In this lesson, we’ll learn how to find statistics that summarize all of the information in a sample about the desired parameter. Such statistics are called sufficient statistics.

2. Definition of Sufficiency

2.1 Definition:

Let $X_1, X_2, ..., X_n$ be a random sample from a probability distribution with unknown parameter $\theta$ . Then, the statistic $Y=u(X_1,X_2,...,X_n)$ is said to be sufficient for $\theta$ if the conditional distribution of $X_1,X_2,...,X_n$ , given the statistic $Y$ , i.e., $p(X_1 = x_1,X_2 = x_2,...,X_n = x_n \mid Y = y)$ does not depend on the parameter $\theta$ .
- Why called “sufficient”?
- We say that $Y$ is sufficient for $\theta$ , since once the value of $Y$ is known (即，有了value of $Y$ (i.e., given $Y=y$ ) 就足以获取了关于未知参数 $\theta$ 的全部可用信息), 并且no other function of $X_1,X_2,...,X_n$ will provide any additional information about the possible value of $\theta$ .
- Sufficiency means that if we know the value of $Y$ , we cannot gain any further information about the parameter $\theta$ by considering other functions of the data $X_1,X_2,...,X_n$ .

2.2 Example 1 - Binomial Distribution：

Consider Bernoulli trials：

Let $X_1, X_2, ..., X_n$ be a random sample of $n$ Bernoulli trials in which the success has the probability $p$ , and the fail with $1-p$ , i.e, $P(X_i = 1) = p$ , and $P(X_i = 0) = q = 1 - p$ , for $i = 1, 2, ..., n$ . Suppose, in a random sample of $n = 40$ , that success events occur $Y=\sum_{i=1}^n X_i =22$ in total. If we know the value of $Y$ , the number of successes in $n$ trials, can we gain any further information about the parameter $p$ by considering other functions of the data $X_1,X_2,...,X_n$ ? Or equivalently is $Y$ sufficient for $p$ ?

Solution:

The definition of sufficiency tells us that if the conditional distribution of $X_1,X_2,...,X_n$ , given the statistic $Y$ , does not depend on $p$ , then $Y$ is said to be a sufficient statistic for the unknown parameter $p$ . The conditional distribution of $X_1,X_2,...,X_n$ , given $Y$ , is given by:

p (X 1 = x 1, X 2 = x 2, . . ., X n = x n ∣ Y = y) = P ( X 1 = x 1 , X 2 = x 2 , . . . , X n = x n , Y = y ) P ( Y = y ) (2.1)

$p(X_1 = x_1,X_2 = x_2,...,X_n = x_n \mid Y = y) = \frac{P(X_1 = x_1,X_2 = x_2,...,X_n = x_n , Y = y)}{P(Y=y)} \qquad (2.1)$

Now, for the sake of concreteness, suppose we were to observe a random sample of size $n = 3$ in which $x_1 = 1, x_2 = 0, x_3 = 1$ . In this case:

P (X 1 = 1, X 2 = 0, X 3 = 1, Y = 1) = 0

$P(X_1=1,X_2=0,X_3=1,Y=1) = 0$
because

(Y=1)≠(∑ni=1Xi=1+0+1=2) $(Y = 1) \neq (\sum_{i=1}^n X_i = 1 + 0 + 1 = 2)$ , corresponding to an impossible event in the numerator of (2.1) therefore with its probability being 0.

Now, let’s consider an event that is possible, namely $(X_1=1, X_2 = 0, X_3 = 1, Y = 2)$ . In that case, we have, by independence:

P (X 1 = 1, X 2 = 0, X 3 = 1, Y = 2) = p (1 - p) p = p 2 (1 - p)

$P(X_1=1, X_2 = 0, X_3 = 1, Y = 2) = p(1−p)p=p^2(1−p)$
So, in general:

P (X 1 = x 1, X 2 = x 2, . . ., X n = x n, Y = y) = 0, i f \sum i = 1 n X i \neq y

$P(X_1 = x_1,X_2 = x_2,...,X_n = x_n , Y = y) = 0, \quad if \quad \sum_{i=1}^n X_i \neq y$
and

P (X 1 = x 1, X 2 = x 2, . . ., X n = x n, Y = y) = p y (1 - p) n - y, i f \sum i = 1 n X i = y

$P(X_1 = x_1,X_2 = x_2,...,X_n = x_n , Y = y) = p^y(1−p)^{n−y}, \quad if \quad \sum_{i=1}^n X_i = y$

Now, the denominator in (2.1) is the binomial probability of getting exactly $y$ successes in $n$ trials with a probability of success $p$ . That is, the denominator is:

P (Y = y) = (n y) p y (1 - p) n - y

$P(Y=y)=\dbinom{n}{y}p^y(1−p)^{n−y}$ for

y=0,1,2,...,n $y = 0, 1, 2, ..., n$ .

Putting the numerator and denominator together, we get

P (X 1 = x 1, X 2 = x 2, . . ., X n = x n ∣ Y = y) = 1 ( n y ), i f \sum i = 1 n X i = y

$P(X_1 = x_1,X_2 = x_2,...,X_n = x_n \mid Y = y) = \frac{1}{\dbinom{n}{y}}, \quad if \quad \sum_{i=1}^n X_i = y$
and

P (X 1 = x 1, X 2 = x 2, . . ., X n = x n, Y = y) = 0, i f \sum i = 1 n X i \neq y

$P(X_1 = x_1,X_2 = x_2,...,X_n = x_n , Y = y) = 0, \quad if \quad \sum_{i=1}^n X_i \not = y$

Conclusion 1:

We have just shown that the conditional distribution of $X_1, X_2, ..., X_n$ given $Y$ does not depend on $p$ . Therefore, $Y$ is indeed sufficient for $p$ . That is, once the value of $Y$ is known, no other function of $X_1, X_2, ..., X_n$ will provide any additional information about the possible value of $p$ .

3. Factorization Theorem

3.1 We need more easy method to identify sufficiency:

While the definition of sufficiency may make sense intuitively, it is not always all that easy to find the conditional distribution of $X_1, X_2, ..., X_n$ given $Y$ . Not to mention that we’d have to find the conditional distribution of $X_1, X_2, ..., X_n$ given $Y$ for every $Y$ that we’d want to consider a possible sufficient statistic! Therefore, using the formal definition of sufficiency as a way of identifying a sufficient statistic for a parameter $θ$ can often be a daunting road to follow. Thankfully, a theorem often referred to as the Factorization Theorem provides an easier alternative!

3.2 Factorization Theorem:

Let $X_1, X_2, ..., X_n$ denote random variables with joint probability density function or joint probability mass function $f(x_1, x_2, ..., x_n \mid \theta)$ , which depends on the parameter $\theta$ . Then, the statistic $Y=u(X_1, X_2, ..., X_n)$ is sufficient for $\theta$ if and only if the p.d.f (or p.m.f.) can be factored into two components, that is:

$f (x 1, x 2, . . ., x n ∣ θ) = ϕ [u (x 1, x 2, . . ., x n) ∣ θ] \cdot h (x 1, x 2, . . ., x n)$ $f(x_1,x_2,...,x_n \mid \theta) = \phi [u(x_1,x_2,...,x_n) \mid \theta] \cdot h(x_1,x_2,...,x_n)$
where:
- $\phi$ is a function that depends on the data $x_1,x_2,...,x_n$ only through the function $y = u(x_1,x_2,...,x_n)$ , and
- the function $h(x_1,x_2,...,x_n)$ does not depend on the parameter $\theta$ .

3.3 Example 2 - Poisson Distribution:

Recall that the mathematical constant $e$ is the unique real number such that the value of the derivative (slope of the tangent line) of the function $f(x) = e^x$ at the point $x = 0$ is equal to $1$ . It turns out that the constant is irrational, but to five decimal places, it equals $e = 2.71828$ . Also, note that there are (theoretically) an infinite number of possible Poisson distributions. Any specific Poisson distribution depends on the parameter $\lambda$ .

Let $X_1, X_2, ..., X_n$ denote a random sample from a Poisson distribution with parameter $\lambda > 0$ . Find a sufficient statistic for the parameter $\lambda$ .

Solution:

Because $X_1, X_2, ..., X_n$ is a random sample, the joint probability mass function of $X_1, X_2, ..., X_n$ is, by independence:

$f (x 1, x 2, . . ., x n ∣ λ) = f (x 1 ∣ λ) \cdot f (x 2 ∣ λ) \cdot \dots \cdot f (x n ∣ λ) = e - λ λ x 1 x 1 ! \cdot e - λ λ x 2 x 2 ! \cdot \dots \cdot e - λ λ x n x n ! = (e - n λ λ \sum x i) \cdot (1 x 1 ! x 2 ! \dots x n !)$ $\begin{split} f(x_1,x_2,...,x_n \mid \lambda) &= f(x_1 \mid \lambda) \cdot f(x_2\mid \lambda) \cdot \dots \cdot f(x_n \mid \lambda) \\ &= \frac{e^{-\lambda}\lambda^{x_1}}{x_1!} \cdot \frac{e^{-\lambda}\lambda^{x_2}}{x_2!} \cdot \dots \cdot \frac{e^{-\lambda}\lambda^{x_n}}{x_n!} \\ &= \left ( e^{-n \lambda} \lambda ^{\sum x_i} \right ) \cdot \left ( \frac{1}{x_1!x_2!\dots x_n!} \right )\\ \end{split}$

Hey, look at that! We just factored the joint p.m.f. into two functions, one ( $\phi$ ) being only a function of the statistic $Y=\sum_{i=1}^n X_i$ and the other ( $h$ ) not depending on the parameter $\lambda$ :

We can also write the joint p.m.f. as:

$f (x 1, x 2, . . ., x n ∣ λ) = (e - n λ λ n x ¯) \cdot (1 x 1 ! x 2 ! \dots x n !)$ $f(x_1,x_2,...,x_n \mid \lambda) = \left ( e^{-n \lambda} \lambda ^{n\bar{x}} \right ) \cdot \left ( \frac{1}{x_1!x_2!\dots x_n!} \right )$
Therefore, the Factorization Theorem tells us that $Y=\bar{X}$ is also a sufficient statistic for $\lambda$ .

If you think about it, it makes sense that $Y=\bar{X}$ and $Y=\sum_{i=1}^n X_i$ are both sufficient statistics, because if we know $Y=\bar{X}$ , we can easily find $Y=\sum_{i=1}^n X_i$ , and vice verse.

Conclusion 2:

There can be more than one sufficient statistic for a parameter $\theta$ . In general, if $Y$ is a sufficient statistic for a parameter $\theta$ , then every one-to-one function of $Y$ not involving $\theta$ is also a sufficient statistic for $\theta$ .

3.4 Example 3 - Gaussian Distribution $N\left(\mu, 1\right)$ :

Let $X_1, X_2, ..., X_n$ be a random sample from a normal distribution with mean $\mu$ and variance $\sigma = 1$ . Find a sufficient statistic for the parameter $\mu$ .

Solution:

For i.i.d. data $X_1, X_2, ..., X_n$ , the joint probability density function of $X_1, X_2, ..., X_n$ is

$f (x 1, x 2, . . ., x n ∣ μ) = f (x 1 ∣ μ) \times f (x 2 ∣ μ) \times \dots \times f (x n ∣ μ) = 1 ( 2 π ) 1 / 2 e x p [- 1 2 (x 1 - μ) 2] \times 1 ( 2 π ) 1 / 2 e x p [- 1 2 (x 2 - μ) 2] \times \dots \times 1 ( 2 π ) 1 / 2 e x p [- 1 2 (x n - μ) 2] = 1 ( 2 π ) n / 2 e x p [- 1 2 \sum i = 1 n (x i - μ) 2]$ $\begin{split} f(x_1,x_2,...,x_n \mid \mu) &= f(x_1 \mid \mu) \times f(x_2 \mid \mu) \times \dots \times f(x_n \mid \mu) \\ &= \frac{1}{(2\pi)^{1/2}} exp\left [-\frac{1}{2}(x_1 - \mu)^2 \right ] \times \frac{1}{(2\pi)^{1/2}} exp\left [-\frac{1}{2}(x_2 - \mu)^2 \right ] \times \dots \times \frac{1}{(2\pi)^{1/2}} exp\left [-\frac{1}{2}(x_n - \mu)^2 \right ] \\ &= \frac{1}{(2\pi)^{n/2}} exp\left [-\frac{1}{2} \sum_{i = 1}^{n}(x_i - \mu)^2 \right ] \end{split}$

A trick to making the factoring of the joint p.d.f. an easier task is to add $0$ to the quantity in parentheses in the summation. That is:

Now, squaring the quantity in parentheses, we get:

$f (x 1, x 2, . . ., x n; μ) = 1 ( 2 π ) n / 2 e x p [- 1 2 \sum i = 1 n [(x i - x ¯) 2 + 2 (x i - x ¯) (x ¯ - μ) + (x ¯ - μ) 2]]$ $f(x_1, x_2, ... , x_n;\mu) = \frac{1}{(2\pi)^{n/2}} exp \left[ -\frac{1}{2}\sum_{i=1}^{n}\left[ (x_i - \bar{x})^2 +2(x_i - \bar{x}) (\bar{x}-\mu)+ (\bar{x}-\mu)^2\right] \right]$
And then distributing the summation, we get:

$f (x 1, x 2, . . ., x n; μ) = 1 ( 2 π ) n / 2 e x p [- 1 2 \sum i = 1 n (x i - x ¯) 2 - (x ¯ - μ) \sum i = 1 n (x i - x ¯) - 1 2 \sum i = 1 n (x ¯ - μ) 2]$ $f(x_1, x_2, ... , x_n;\mu) = \frac{1}{(2\pi)^{n/2}} exp \left[ -\frac{1}{2}\sum_{i=1}^{n} (x_i - \bar{x})^2 - (\bar{x}-\mu) \sum_{i=1}^{n}(x_i - \bar{x}) -\frac{1}{2}\sum_{i=1}^{n}(\bar{x}-\mu)^2\right]$

But, the middle term in the exponent is $0$ , and the last term, because it doesn’t depend on the index $i$ , can be added up $n$ times:

So, simplifying, we get:
$f (x 1, x 2, . . ., x n; μ) = {e x p [- n 2 (x ¯ - μ) 2]} \times {1 ( 2 π ) n / 2 e x p [- 1 2 \sum i = 1 n (x i - x ¯) 2]}$ $f(x_1, x_2, ... , x_n;\mu) = \left\{ exp \left[ -\frac{n}{2} (\bar{x}-\mu)^2 \right] \right\} \times \left\{ \frac{1}{(2\pi)^{n/2}} exp \left[ -\frac{1}{2}\sum_{i=1}^{n} (x_i - \bar{x})^2 \right] \right\}$

In summary, we have factored the joint p.d.f. into two functions, one ( $\phi$ ) being only a function of the statistic $Y=\bar{X}$ and the other ( $h$ ) not depending on the parameter $\mu$ :

Conclusion 3:

Therefore, the Factorization Theorem tells us that $Y=\bar{X}=\frac{1}{n}\sum_{i=1}^n X_i$ is a sufficient statistic for $\mu$ .
Now, $Y=\bar{X}^3$ is also sufficient for $\mu$ , because if we are given the value of $\bar{X}^3$ , we can easily get the value of $\bar{X}$ through the one-to-one function $w=y^{1/3}$ , that is $W=(\bar{X}^3)^{1/3}=\bar{X}$ .
However, $Y=\bar{X}^2$ is not a sufficient statistic for $\mu$ , because it is not a one-to-one function, with both $+\bar{X}$ and $-\bar{X}$ mapped to $\bar{X}^2$ .

3.5 Example 4 - Exponential Distribution:

Let $X_1, X_2, ..., X_n$ be a random sample from an exponential distribution with parameter $\theta$ . Find a sufficient statistic for the parameter $\theta$ .

Solution:

The joint probability density function of $X_1, X_2, ..., X_n$ is, by independence:

$f (x 1, x 2, . . ., x n; θ) = f (x 1; θ) \times f (x 2; θ) \times . . . \times f (x n; θ)$ $f(x_1, x_2, ... , x_n;\theta) = f(x_1;\theta) \times f(x_2;\theta) \times ... \times f(x_n;\theta)$
the joint p.d.f. is:
$f (x 1, x 2, . . ., x n; θ) = 1 θ e x p (- x 1 θ) \times 1 θ e x p (- x 2 θ) \times . . . \times 1 θ e x p (- x n θ)$ $f(x_1, x_2, ... , x_n;\theta) =\frac{1}{\theta}exp\left( \frac{-x_1}{\theta}\right) \times \frac{1}{\theta}exp\left( \frac{-x_2}{\theta}\right) \times ... \times \frac{1}{\theta}exp\left( \frac{-x_n}{\theta} \right)$
Now, simplifying, by adding up all $n$ of the $\theta s'$ and the $n$ $x_i$ ’s in the exponents, we get:
$f (x 1, x 2, . . ., x n; θ) = 1 θ n e x p (- 1 θ \sum i = 1 n x i)$ $f(x_1, x_2, ... , x_n;\theta) =\frac{1}{\theta^n}exp\left( - \frac{1}{\theta} \sum_{i=1}^{n} x_i\right)$
We have again factored the joint p.d.f. into two functions, one ( $\phi$ ) being only a function of the statistic $Y=\sum_{i=1}^n X_i$ and the other ( $h = 1$ ) not depending on the parameter $\theta$ :

Conclusion 4:

Therefore, the Factorization Theorem tells us that $Y=\sum_{i=1}^n X_i$ is a sufficient statistic for $\theta$ . And, since $Y=\bar{X}$ is a one-to-one function of $Y=\sum_{i=1}^n X_i$ , it implies that $Y=\bar{X}$ is also a sufficient statistic for $\theta$ .

4. Exponential Form

4.1 Exponential Form

You might not have noticed that in all of the examples we have considered so far in this lesson, every p.d.f. or p.m.f. could be written in what is often called exponential form, that is:
$f (x ∣ θ) = e x p [K (x) p (θ) + S (x) + q (θ)]$ $f( x \mid \theta ) = exp[K(x)p(\theta)+S(x)+q(\theta)]$

1) Exponential Form of Bernoulli Distribution:

For example, the Bernoulli random variables with p.m.f. is written in exponential form as:

with

(1) $K(x) =x$ and $S(x) = \ln(1)$ being functions only of $x$ ,

(2) $p(p) = \ln\frac{p}{1-p}$ and $q(p) = \ln(1-p)$ being functions only of the parameter $p$ , and

(3) the support $x = 0, 1$ not depending on the parameter $p$ .

2) Exponential Form of Poisson Distribution:

with

(1) $K(x) = x$ and $S(x)= -\ln(x!)$ being functions only of $x$ ,

(2) $p(\lambda) = \ln \lambda$ and $q(\lambda) = -\lambda$ being functions only of the parameter $\lambda$ , and
(3) the support $x = 0, 1, 2, \dots$ not depending on the parameter $\lambda$ .

3) Exponential Form of Gaussian Distribution $N(\mu, 1)$ :

with

(1) $K(x) = x$ and $S(x) = -\frac{x^2}{2}$ being functions only of $x$ ,

(2) $p(\mu) = \mu$ and $q(\mu) = -\frac{\mu^2}{2} - \frac{1}{2}\ln(2\pi)$ being functions only of the parameter $\mu$ , and
(3) the support $-\infty < x < \infty$ not depending on the parameter $\mu$ .

4) Exponential Form of Exponential Distribution:

with

(1) $K(x) = -x$ and $S(x)= \ln(1) = 0$ being functions only of $x$ ,

(2) $p(\theta)= \frac{1}{\theta}$ and $q(\theta)= -\ln\theta$ being functions only of the parameter $\theta$ , and
(3) the support $x \geq 0$ not depending on the parameter $\theta$ .

4.2 Exponential Criterion

It turns out that writing p.d.f.s and p.m.f.s in exponential form provides us yet a third way of identifying sufficient statistics for our parameters. The following theorem tells us how.

Theorem:

Let $X_1, X_2, ..., X_n$ be a random sample from a distribution with a p.d.f. or p.m.f. of the exponential form:

$f (x ∣ θ) = e x p [K (x) p (θ) + S (x) + q (θ)]$ $f( x \mid \theta ) = exp[K(x)p(\theta)+S(x)+q(\theta)]$
with a support that does not depend on θ, that is,
- (1) $K(x)$ and $S(x) =$ being functions only of $x$ ,
- (2) $p(\theta)$ and $q(\theta)$ being functions only of the parameter $\theta$ , and
- (3) the support being free of the parameter $\theta$ .

Then, the statistic:
$\sum i = 1 n K (X i)$ $\sum_{i=1}^n K(X_i)$ is sufficient for $\theta$ .

Proof:

$f (x 1, x 2, . . ., x n; θ) = f (x 1; θ) \times f (x 2; θ) \times . . . \times f (x n; θ)$ $f(x_1, x_2, ... , x_n;\theta)= f(x_1;\theta) \times f(x_2;\theta) \times ... \times f(x_n;\theta)$
$f (x 1, . . ., x n; θ) = exp [K (x 1) p (θ) + S (x 1) + q (θ)] \times . . . \times exp [K (x n) p (θ) + S (x n) + q (θ)]$ $f(x_1, ... , x_n;\theta)=\text{exp}\left[K(x_1)p(\theta) + S(x_1)+q(\theta)\right] \times ... \times \text{exp}\left[K(x_n)p(\theta) + S(x_n)+q(\theta)\right]$
Collecting like terms in the exponents, we get: $f (x 1, . . ., x n; θ) = exp [p (θ) \sum i = 1 n K (x i) + \sum i = 1 n S (x i) + n q (θ)]$ $f(x_1, ... , x_n;\theta)=\text{exp}\left[p(\theta)\sum_{i=1}^{n}K(x_i) + \sum_{i=1}^{n}S(x_i) + nq(\theta)\right]$
which can be factored as: $f (x 1, . . ., x n; θ) = {exp [p (θ) \sum i = 1 n K (x i) + n q (θ)]} \times {exp [\sum i = 1 n S (x i)]}$ $f(x_1, ... , x_n;\theta)=\left\{ \text{exp}\left[p(\theta)\sum_{i=1}^{n}K(x_i) + nq(\theta)\right]\right\} \times \left\{ \text{exp}\left[\sum_{i=1}^{n}S(x_i)\right] \right\}$
We have factored the joint p.m.f. or p.d.f. into two functions:
- one ( $\phi$ ) being only a function of the statistic $Y=\sum_{i=1}^n K(X_i)$ and
- the other ( $h$ ) not depending on the parameter $\theta$ :

Therefore, the Factorization Theorem tells us that $Y=\sum_{i=1}^n K(X_i)$ is a sufficient statistic for $\theta$ .

4.3 Example 5 - Geometric Distribution:

Let $X_1, X_2, ..., X_n$ be a random sample from a geometric distribution with parameter $p$ . Find a sufficient statistic for the parameter $p$ .

Solution:

The probability mass function of a geometric random variable is:

$f (x; p) = (1 - p) x - 1 p$ $f(x; p) = (1-p)^{x-1}p$ for $x = 1, 2, 3, \dots$ The p.m.f. can be written in exponential form as
$f (x; p) = exp [x log (1 - p) + log (1) + log (p 1 - p)]$ $f(x;p) = \text{exp}\left[ x\text{log}(1-p)+\text{log}(1)+\text{log}\left( \frac{p}{1-p} \right)\right]$

Conclusion 5:

Therefore, $Y=\sum_{i=1}^n X_i$ is sufficient for $p$ . Easy as pie!

5. Two or More Parameters

What happens if a probability distribution has two parameters, $\theta_1$ and $\theta_2$ , say, for which we want to find sufficient statistics, $Y_1$ and $Y_2$ ? Fortunately, the definitions of sufficiency can easily be extended to accommodate two (or more) parameters. Let’s start by extending the Factorization Theorem.

5.1 Factorization Theorem

5.2 Example 6 - Gaussian Distribution $N\left( \mu, \sigma^2 \right)$ :

Let $X_1, X_2, ..., X_n$ denote a random sample from a normal distribution $N(\theta_1, \theta_1)$ . That is, $\theta_1$ denotes the mean $\mu$ and $\theta_2$ denotes the variance $\sigma^2$ . Use the Factorization Theorem to find joint sufficient statistics for $\theta_1$ and $\theta_2$ .

Solution:

The joint probability density function of $X_1, X_2, ..., X_n$ is, by independence:

$f (x 1, x 2, . . ., x n; θ 1, θ 2) = f (x 1; θ 1, θ 2) \times f (x 2; θ 1, θ 2) \times . . . \times f (x n; θ 1, θ 2)$ $f(x_1, x_2, ... , x_n;\theta_1, \theta_2) = f(x_1;\theta_1, \theta_2) \times f(x_2;\theta_1, \theta_2) \times ... \times f(x_n;\theta_1, \theta_2)$

Due to the Gaussian pdf

$f (x i ∣ θ 1, θ 2) = 1 ( 2 π θ 2 ) 1 / 2 e x p [- 1 2 ( x i - θ 1 ) 2 θ 2]$ $f(x_i \mid \theta_1, \theta_2) = \frac{1}{(2\pi \theta_2)^{1/2}} exp\left [-\frac{1}{2}\frac{(x_i - \theta_1)^2}{\theta_2} \right ]$

We get
$f (x 1, x 2, . . ., x n; θ 1, θ 2) = (1 2 π θ 2 - - - - \sqrt) n exp [- 1 2 \sum n i = 1 ( x i - θ 1 ) 2 θ 2]$ $f(x_1, x_2, ... , x_n;\theta_1, \theta_2) = \left(\frac{1}{\sqrt{2\pi\theta_2}}\right)^n \text{exp} \left[-\frac{1}{2}\frac{\sum_{i=1}^{n}(x_i-\theta_1)^2}{\theta_2} \right]$
Rewriting the first factor, and squaring the quantity in parentheses, and distributing the summation, in the second factor, we get: $f (x 1, x 2, . . ., x n; θ 1, θ 2) = exp ⎡ ⎣ log (1 2 π θ 2 - - - - \sqrt) n ⎤ ⎦ exp [- 1 2 θ 2 {\sum i = 1 n x 2 i - 2 θ 1 \sum i = 1 n x i + \sum i = 1 n θ 21}]$ $f(x_1, x_2, ... , x_n;\theta_1, \theta_2) = \text{exp} \left[\text{log}\left(\frac{1}{\sqrt{2\pi\theta_2}}\right)^n\right] \text{exp} \left[-\frac{1}{2\theta_2}\left\{ \sum_{i=1}^{n}x_{i}^{2} -2\theta_1\sum_{i=1}^{n}x_{i} +\sum_{i=1}^{n}\theta_{1}^{2} \right\}\right]$

Simplifying yet more, we get:

$f (x 1, x 2, . . ., x n; θ 1, θ 2) = exp [- 1 2 θ 2 \sum i = 1 n x 2 i + θ 1 θ 2 \sum i = 1 n x i - n θ 2 1 2 θ 2 - n log 2 π θ 2 - - - - \sqrt]$ $f(x_1, x_2, ... , x_n;\theta_1, \theta_2) = \text{exp} \left[ -\frac{1}{2\theta_2}\sum_{i=1}^{n}x_{i}^{2}+\frac{\theta_1}{\theta_2}\sum_{i=1}^{n}x_{i} -\frac{n\theta_{1}^{2}}{2\theta_2}-n\text{log}\sqrt{2\pi\theta_2} \right]$

Look at that! We have factored the joint p.d.f. into two functions, one ( $\phi$ ) being only a function of the statistic $Y_1=\sum_{i=1}^n X_i^2$ and $Y_2=\sum_{i=1}^n X_i$ , and the other ( $h = 1$ ) not depending on the parameter $\theta_1$ and $\theta_2$ :

Conclusion 6.1:

Therefore, the Factorization Theorem tells us that $Y_1=\sum_{i=1}^n X_i^2$ and $Y_2=\sum_{i=1}^n X_i$ are joint sufficient statistics for $\theta_1$ and $\theta_2$ .
And, the one-to-one functions of $Y_1$ and $Y_2$ , namely: $X ¯ = Y 2 n = 1 n \sum i = 1 n X i$ $\bar{X} =\frac{Y_2}{n}=\frac{1}{n}\sum_{i=1}^{n}X_i$
and $S 2 = Y 1 - ( Y 2 2 / n ) n - 1 = 1 n - 1 [\sum i = 1 n X 2 i - n X ¯ 2]$ $S^2=\frac{Y_1-(Y_{2}^{2}/n)}{n-1}=\frac{1}{n-1} \left[\sum_{i=1}^{n}X_{i}^{2}-n\bar{X}^2 \right]$ are also joint sufficient statistics for $\theta_1$ and $\theta_2$ .
We have just shown that the intuitive estimators of $\mu$ and $\sigma^2$ are also sufficient estimators. That is, the data contain no more information than the estimators $\bar{X}$ and $S^2$ do about the parameters $\mu$ and $\sigma^2$ . That seems like a good thing!

5.3 Exponential Criterion

We have just extended the Factorization Theorem. Now, the Exponential Criterion can also be extended to accommodate two (or more) parameters. It is stated here without proof.

Exponential Criterion:

Let $X_1, X_2, ..., X_n$ be a random sample from a distribution with a p.d.f. or p.m.f. of the exponential form: $f (x; θ 1, θ 2) = exp [K 1 (x) p 1 (θ 1, θ 2) + K 2 (x) p 2 (θ 1, θ 2) + S (x) + q (θ 1, θ 2)]$ $f(x;\theta_1,\theta_2)=\text{exp}\left[K_1(x)p_1(\theta_1,\theta_2)+K_2(x)p_2(\theta_1,\theta_2)+S(x) +q(\theta_1,\theta_2) \right]$ with a support that does not depend on the parameters $\theta_1$ and $\theta_2$ . Then, the statistics $Y_1=\sum_{i=1}^n K_1(X_i)$ and $Y_2=\sum_{i=1}^n K_2(X_i)$ are jointly sufficient for $\theta_1$ and $\theta_2$ .

5.4 Example 6 - Gaussian Distribution $N\left( \mu, \sigma^2 \right)$ (continued):

Let $X_1, X_2, ..., X_n$ denote a random sample from a normal distribution $N(\theta_1, \theta_1)$ . That is, $\theta_1$ denotes the mean $\mu$ and $\theta_2$ denotes the variance $\sigma^2$ . Use the Exponential Criterion to find joint sufficient statistics for $\theta_1$ and $\theta_2$ .

Solution:

The probability density function of a normal random variable with mean $\theta_1$ and variance $\theta_2$ can be written in exponential form as:

Conclusion 6.2:

Therefore, the statistics $Y_1=\sum_{i=1}^n X_i^2$ and $Y_2=\sum_{i=1}^n X_i$ are joint sufficient statistics for $\theta_1$ and $\theta_2$ .

6. Reference

[1]: Lesson 53: Sufficient Statistics (ttps://onlinecourses.science.psu.edu/stat414/print/book/export/html/244)

GloryOfFamily

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
机器学习学习笔记 PRML Chapter 2.0 : Prerequisite之Sufficient Statistics

机器学习学习笔记 PRML Chapter 2.0 : Prerequisite之Sufficient Statistics
复制链接

扫一扫