Central Limit Theorem Overview

最新推荐文章于 2024-08-29 07:31:49 发布

The Well-Built City

最新推荐文章于 2024-08-29 07:31:49 发布

阅读量419

点赞数 1

分类专栏： Mathematics Misc Statistics Misc 文章标签：概率论统计学

本文链接：https://blog.csdn.net/Bill_Wang_01/article/details/116931197

版权

Mathematics Misc 同时被 2 个专栏收录

12 篇文章 0 订阅

订阅专栏

Statistics Misc

9 篇文章 0 订阅

订阅专栏

A Brief Note on Central Limit Theorem

Random Variable
- Definition
- Probability Density/Mass Function & Population
Central Limit Theorem
- Caveat 1-How many samples do we take?
- Caveat 2-Limitations of CLT.
Computation Example of CLT

Mathematical definitions/proofs/derivations are all omitted because this note intends to serve as a simple overview of the CLT, its power, and a misconception that I used to have about the CLT.

Random Variable

Definition

A random variable, $Y$ , is a map (function) from the sample space $S$ to $\R$
- The sample space is the set of all possibe outcomes (events) resulted from a random experiment.
For example, if we care about the test scores of all the first year students in an university, then we have a sample space consists of all the integers in, say, for exmaple, $[0, 100]$ , and we can naturally define the random variable $Y$ to be $Y:S\to \R$ by mapping each possible score in the sample space to the same numerical value in $\R$ .

Probability Density/Mass Function & Population

For discrete random variables, we can use its probability mass function to describe its probability distribution.
For continuous random variables, we can use its probability density function (PDF) to describe its probability distribution.
The probability distribution of a random variable is usually unknown in reality because the knowledge about how the values in the population of our interest are actually distributed simply cannot be obtained or cannot be suitably modeled by a closed-form mathematical formula. Consequently, information like the population mean $\mu$ (the mean of the R.V.) and the variance $\sigma^2$ (the variance of the R.V.) of the random variable remain unknown to us.
However, we do have a theorem that allows us to estimate these two parameters of the population from limited information that we can obtain from random samples drawn from the population.

Central Limit Theorem

Now suppose we take out a random sample of size $\underline n$ from the population distribution, $N(\mu,\sigma^2)$ . Effectively, this means $n$ random variables, $Y_1,...,Y_n$ that are independent and from the same common probability distribution. (The definition of independence of multiple RV’s is omitted). - We call $Y_1,....,Y_n$ independent and identically distributed, where the distribution of each $Y_i$ follows the population distribution $N(\mu, \sigma^2)$ . This is denoted $Y_i \sim^{iid}N(\mu, \sigma^2)\forall i=1,...,n$ .
The power of the Central Limit Theorem is that you can always estimate $\mu$ with the limited information from your sample(s) no matter how weird that unknown population probability distribution might be.
The Central Limit Theorem states that the sum of these $n$ independent and identically distributed random variables, denoted $\bf{X=\sum_{i=1}^nY_i}$ where $Y_i \sim^{iid}N(\mu, \sigma^2)$ , follows a normal distribution when $n$ is large enough.
Notationally, it says the following:
- $\bf {X=\sum_{i=1}^nY_i}$ follows $\bf {N(\mu, \sigma^2n)}$ when $n\to \infin$
- Or equivalently, $\bf {Z_n=\frac{x-n\mu}{\sqrt{n\sigma^2}}}$ follows $\bf {N(0,1)}$ when $\bf {n\to \infin}$ .
Remarks on the statements of the Theorem:
- The random variable $Z_n$ is a standardization of the random variable $X$ i.e. by subtracting the mean of $X$ from $X$ and then dividing the result by the standard deviation of $X$ . Indeed, $N (0, 1)$ , the standard normal, is also called the $Z$ -distribution.
- The sample mean (another random variable), $\bar X=X/n$ , or the sampling distribution of sample mean, follows a normal distribution $N(\mu,\frac{\sigma^2}{n})$ by CLT. This is obtained by $X\sim N(n\mu,n\sigma^2)$ , so $E(X/n)=\frac{E(X)}{n}$ , and $Var(X/n)=\frac{Var(X)}{n^2}$ .
- The distribution of $X$ is the sampling distribution of the sample statistic: sample sum (can also be sample mean if you use $\bar X:=X/n$ ) when the sample size is $n$ . Thus, CLT also means: when randomly sampling samples of size $n$ from any population with mean $\mu$ and standard deviation $\sigma$ , the sampling distribution of the sample sum follows $N(n\mu,\frac{\sigma^2}{n})$ when $n$ is large enough. Recall that $X$ is a random variable, so it’s probability distribution is never known to us in reality because, to obtain the exact probability density function of $X$ , we will have to take out infinitely many random samples of size $n$ , which is impossible.

Caveat 1-How many samples do we take?

The CLT has nothing to do with how many random samples of size $n$ you actually take from the population! The distribution is just related to the sample size and the parameters of the population.
The reason is that, by the CLT’s statement, the distribution of $X$ is already decided (theoretically, as the population may be unknown) when sample size $n$ is decided, so how many samples you actually take does not really matter.
Indeed, to estimate the sampling distribution of sample mean, only one random sample is needed:
- Even if we only have one random sample, the sample statistics $\bar y$ and $S$ (sample standard deviation) we can obtain are enough for giving unbiased estimates of the population parameters $\mu$ and $\sigma^2$ (where $\mu$ is estimated by the estimator $\bar y$ , sample mean, and $\sigma^2$ is estimated by $S^2$ ).
- Thus, merely using the data from one random sample, CLT allows us to decide the approximate distribution of $X$ (or $X / n$ ).
- More precisely, as $X$ refers to the sum of the $Y$ values in a random sample, basically, CLT is telling us the sampling distribution of sample mean has a mean of $\bar x$ and a variance of $\frac{S^2}{n}$ (i.e. we can substitute $\mu$ in the original formula with $\bar x$ and $\sigma$ with $S$ as these are unbiased estimates).
- Moreover, this estimate of the variance of the sampling distribution of sample mean can in turn be used to calculate the confidence interval, as the idea of confidence interval also relies on the sampling distribution of sample mean.

Caveat 2-Limitations of CLT.

The CLT only applies to the sampling distributions of sample statistics that can be directly derived from the sample sum. For example, sample mean and sample proportion. It fails when we want to consider the sampling distribution of statistics like the sample variance. Therefore, $t$ tests and/or $z$ tests relying on CLT will not work if we want to test hypothesis about those parameters. Thus, test methods like bootstrap test may help. See the Boostrap part in this post.

Computation Example of CLT

The lengths of bolts produced in a factory are assumed to be normally distributed with mean 3.06 inches and standard deviation 0.63 inches.
Suppose a researcher chooses 70 samples of size 50 from this population. Calculate the mean and variance of the sampling distribution.
Solution:
- Notice that we now actually know the population distribution, so $Y$ 's probability density function is already known to be $N(\mu=3.06, var =0.63^2)$ .
- Now, the sample size $n = 50$ , by CLT, the sample mean, $X / n$ , is approximately normally distributed, and we are asked to estimate the mean and variance of the sampling distribution of $X:=\sum_{i=1}^{50}Y_i$ . By CLT, these are given by $\frac{E(X)}{n}=\frac{3.06*50}{50}=3.06$ and $\frac{var(X)}{n^2}=\frac{50 *0.63^2}{50^2}=0.089^2$ .
  - You may derive these two formulas using the linearity of expectation: $E (a u (x) + b) = a (E (u (x)) + b, a, b$ constanst, $X$ is any RV, and $image(X)\to \R$ is a given function of $X$ .
- Thus, the distribution of the sample mean is $N(3.06, 0.089^2)$
Remark: as per Caveat 1 above, the number $70$ has nothing to do with the sampling distribution of the sample mean!

The Well-Built City

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
Central Limit Theorem Overview

We go over the contents and the application of the CLT along with some preliminary introduction to Random Variables and Unbiased Estimators of population parameters. We offer a computational example of CLT in estimating the Sample Mean's distribution.
复制链接

扫一扫

专栏目录