

As we have seen in the previous article, “Inferential Statisticsplays a significant role in Data Science. Central Limit Theorem(CLT) is the most commonly used technique by Data Scientists in the real-world, a part of Inferential Statistics.

To know about CLT, first, we need to understand the following topics:


  1. Sample

  2. Sampling Distribution


The selection of some of the employees/population from the whole employees/population list is known as Sample.


Let’s say we have a company in which 30,000 employees are working. We want to find out the daily commute time of all the employees. It will be very tedious and time-consuming to go to every employee and note their commute time.

Let’s see if there is another way to complete the task. Say in the company; we took a survey of 100 random employees. After the survey, we calculated the mean of the employees’ commute time at 36.6 min. Is this enough to say that all the employee’s commute time is 36.6 min by considering just 100 random employees?

No, we cannot say that. The overall mean will be something of 36.6+error, i.e., if the error is 3 min, then the employees’ overall mean will be between “36.6–3” to “36.6+3”. Now, how do we find out the error?

To answer that question, first, we have to understand the Sampling Terminology.

  • Total number of items/population, Population Size = N

  • Mean of the population, Population Mean(μ) = (Σ * X)/N

  • Variance of the population, Population Variance(σ²) = Σ( Xi — μ )²/ N

  • Number of items/population, Sample Size = n

  • Mean of the sample employees, Sample Mean(x¯) = (Σ * x)/n

  • Variance of the sample, Sample Variance(S²) = Σ( xi — x¯)²/ n-1

Let’s see how we can find the Sample Mean & Sample Variance from a given Sample Size,


We need to find the average height of people in an area from the following sample data.


Image for post

Sample Size(n) = 5Sample Mean(x¯) =(121.92+133.21+141.34+126.23+175.74)/5 =139.69Sample Variance(S) = sqrt[{(121.92–139.69)²+(133.21–139.69)²+(141.34–139.69)²+(126.23–139.69)²+(175.74–139.69)²}/4] = 21.45

That is how we calculate the Sample Mean and Variance with the help of sample data.


Sampling Distribution is a probability of distribution obtained from many samples drawn from a population list.


What it means is, we have 30000 employees in our company. First, we select 50 random employees and calculate their mean, let it be x¯1. After that, we’ll take another 50 random employees from the whole list and calculate the mean, which is x¯2. Let’s say we continued this process and calculated the mean up to x¯100.

So what we have, interestingly enough, is the distribution for sample means.


If we plot all the sample means distribution in a graph, it represents Binomial Distribution.


Sampling Distribution has some fascinating properties, which ultimately helps in finding the error in population mean.


The sampling distribution’s mean is denoted by μₓ¯.


μₓ¯ = (Sum of all the sample means)/(Total number of samples)


There are two important properties of a sample distribution mean:


  1. Sampling Distribution Mean(μₓ¯) = Population Mean(μ)

  2. Sampling distribution’s standard deviation (Standard error) = σ/√n, where σ is the population’s standard deviation and n is the sample size

The Central Limit Theorem(CLT) states that for any data, provided a high number of samples have been taken. The following properties hold:

  1. Sampling Distribution Mean(μₓ¯) = Population Mean(μ)

  2. Sampling distribution’s standard deviation (Standard error) = σ/√n ≈S/√n

  3. For n > 30, the sampling distribution becomes a normal distribution.

Let’s verify the properties of CLT in Python through Jupyter Notebook.

For the following Python code, we’ll use the datasets of Population and Random Values, which we can find here.


First, import necessary libraries into Jupyter Notebook.

We imported all the necessary packages which we use in further codes. Since we are going to sample the information randomly, we are setting a random seed np.random.seed(42), so that the analysis is reproducible.

Now, let’s read the dataset we are dealing with,


The dataset looks like this,


Image for post
Population Dataset

Let’s extract the ‘Weight’ column from the dataset and see the distribution of that column.

This weight column and its distribution graph looks like this,


Image for post

As we can see, the chart is close to a Normal Distribution graph.


Let’s also find out the mean and standard deviation of the weight column through code.


Mean = 220.67326732673268Std. Dev. = 26.643110470317723

These values are the exact Mean and Standard Deviation values of the Weight Column.


Now, let’s start sampling the data.


First, we’ll take a sample size of 30 members from the data. The reason for that is, after repeated sampling of observations, we need to find if the sampling distribution follows Normal Distribution or not.

首先,我们将从数据中抽取30名成员作为样本。 原因是,在对观察值重复采样之后,我们需要确定采样分布是否遵循正态分布。

The mean value for the above sample = 222.1, which is greater than the actual mean of 220.67. Let’s rerun the code,

The mean value for the above sample = 220.5, which is almost equal to the original mean. If we rerun the code, we’ll get the mean value = 221.6

Each time we take a sample, the mean is different. There is variability in the sample mean itself. Let’s move ahead and find out if the sample mean follows a distribution.

每次我们采样时,均值都不同。 样本平均值本身存在差异。 让我们继续前进,找出样本均值是否遵循分布。

Instead of taking one sample mean at a time, we’ll take about 1000 such sample means and assign it to a variable.


We have converted the sample_means into Series object because the list object does not provide us with Mean and Standard Deviation functions.


The total number of samples = 1000

Now, we have 1000 samples, and it’s mean values with us. Let’s plot the distribution graph using seaborn.

The distribution plot looks like this,


Image for post

As we can observe, the above distribution looks approximately like Normal Distribution.


The other thing we need to check here is the Samples Mean and Standard Deviation.


Samples Mean = 220.6945, which is almost similar to Original Mean’s value 220.67, Sample Std = 4.641450507418211

Let’s see the relation between the Standard deviation of samples and the Standard deviation of actual data.When we divide the standard deviation of original data with its size,



We get the value of above code = 4.86The value is close to the sample_means.std().

So, from the above code, we can infer that:


  • Sampling distribution’s mean (μₓ¯) = Population mean (μ)

  • Sampling distribution’s standard deviation (standard error) = σ/√n

Till now, we have seen the original data of the “Weight” column is in the form of normal distribution. Let’s see whether the sample distribution will be of Normal Distribution form even if the original data is not in the Normal Distribution form.

We’ll take another data set that contains some random values and plot the values in a distribution graph.


The Dataset and the graph looks like this,


Image for post

As we can see, the Values column does not resemble the Normal Distribution graph. It looks somewhat like an exponential distribution.

如我们所见,“值”列与正态分布图不同。 它看起来有点像指数分布。

Let’s pick samples from this distribution, calculate their means, and plot the sampling distribution.


Now, the distribution graph for the samples looks like,


Image for post

Surprisingly, the Distribution of the sample_means we obtained from the Values Column, which is far from Normal Distribution, is still very much a Normal Distribution.


Let’s compare the sample_means Mean value to its parent Mean value.


# The Output will be
As we can see, the sample_means mean value and original dataset’s mean value are both similar.

Similarly, the standard deviation of sample mean is sample_means.std() =13.263962580003142

That value should be quite close to df1.Value.std()/np.sqrt(samp_size) =14.060457446377631

Let’s compare the Distribution graphs of each Dataset with it’s corresponding sampling distribution.


Image for post
Image by Author

As we can see, irrespective of the original dataset’s distribution, the sampling distribution resembles the Normal Distribution Curve.


There’s only one thing to consider now, i.e., Sample Size. We’ll observe that, as the sample size increases, the sampling distribution will approximate a normal distribution even more closely.

Let’s create different Sizes of samples and plot the corresponding distribution graphs.


Now, the Distribution Graph for Sample Sizes of 3, 10, 30, 50, 100, 200 looks like,


Image for post
Distribution of Different Sample Sizes

As we can observe, the distribution graph for Sample Size 3 & 10 does not resemble Normal Distribution. Still, from the Sample Size 30 as the Sample Size increases, the Sample Distribution resembles Normal Distribution.

As a rule of thumb, we can say that a sample size of 30 or above is ideal for concluding that the sampling distribution is nearly normal, and further inferences can be drawn from it.


Through this Python Code, we can conclude that CLT’s following three properties hold.


  1. Sampling Distribution Mean(μₓ¯) = Population Mean(μ)

  2. Sampling distribution’s standard deviation (Standard error) = σ/√n

  3. For n > 30, the sampling distribution becomes a normal distribution.

The mean commute time of 30000 employees (μ)= 36.6 (sample mean) + some margin of error. We can find this margin of error using the CLT (central limit theorem). Now that we know what the CLT is let’s see how we can find the error margin.

Let’s say we have the mean commute time of 100 employees is X¯=36.6 min, and the Standard Deviation of the sample is S=10 min. Using CLT, we can infer that,

  1. Sampling Distribution Mean(μₓ¯) = Population Mean(μ)

  2. Sampling Distributions’ Standard Deviation = σ/√n ≈S/√n = 10/√100 = 1

  3. Since Sampling Distribution is a Normal Distribution


    P(μ-2 < 36.6 < μ+2) = 95.4%, we get this value by 1–2–3 Rule of Normal Distribution Curve.

    P(μ-2 < 36.6 < μ+2) = P(36.6–2< μ < 36.6+2) = 95.4%

You can find the standard distribution curve, Z-Table, and its properties in my previous article, “Inferential Statistics.”

Now, we can say that there is a 95.4% probability that the Population Mean(μ) lies between (36.6–2, 36.6+2). In other words, we are 95.4% confident that the error in estimating the mean ≤ 2.

Hence the probability associated with the claim is called confidence level (Here it is 95.4%).The maximum error made in the sample mean is called the margin of error (Here it is 2min).The final interval of value is called confidence interval {Here it is: (34.6, 38.6)}

We can generalize this concept in the following manner.


Let’s say that we have a sample with sample size n, mean X¯, and standard deviation S. Now, the y% confidence interval (i.e., the confidence interval corresponding to a y% confidence level) for μ would be given by the range:

Confidence interval = (X — (Z* S/√n), X + (Z* S/√n))

where Z* is the Z-score associated with a y% confidence level.

Some commonly used Z* values are given below:

Image for post
That is is how we calculate the margin of error and estimate the value of the mean of the whole population with the help of samples.


结论 (Conclusion)

As we have seen, it is beneficial to find the mean and standard deviation for only a small representative sample. We may have to do this because of time and money constraints. Using CLT properties, we can find the Population Mean(μ), Standard Error(σ/√n), and, most importantly, Confidence interval(y%). CLT is beneficial in polling results reported on the news with confidence intervals, Insurance, Banking, etc. That is all about CLT and its properties and how it can be useful in Data Science.

Thank you for reading and Happy Coding!!!


在这里查看我以前有关Python的文章 (Check out my previous articles about Python here)







