import pandas as pd #data manipulation and analysis
import numpy as np #lib used for working with arrays
import matplotlib.pyplot as plt #lib for plots and visualizations
import seaborn as sns #lib for visualizations
%matplotlib inline
import scipy.stats as stats #probability distribution
One Sample Z-test (when population standard deviation is known)
It is rarely the case when you know the population standard deviation and not the mean...but let's assume that is the caseIt is known from experience that for a certain E-commerce company the mean delivery time of the product is 5 days with a standard deviation of 1.3 days.
The new customer service manager of the company is afraid that the company is slipping and collects a random sample of 45 orders. The mean delivery time of these samples comes out to be 5.25 days.
Is there enough statistical evidence for the manager's apprehension that the mean delivery time of products is greater than 5 days.
One-tailed test, concerning population mean μ, the mean delivery time of products
Use level of significance α=0.05
Let's write the null and alternative hypotheses
Let μ be the mean delivert time of the products
The Manager will test the null hypothesisagainst the alternate hypothesis
Are the assumptions of Z-test satisfied?
- Samples are drawn from a normal distribution - since the sample is 45 (which is >30), Central Limit Theorem states that the distribution of sample means will be normal. If the sample size was less than 30, we would have been able to apply z test on if we knew that the population was normal.
- Observations are from a simple random sample - we are informed that the manager collected a simple random sample
- Standard deviation is known - yes.
Voila! We can use Z-test for this problem
Next is to find Z test statistic
z分数(z-score),也叫标准分数(standard score)是一个数与平均数的差再除以标准差的过程。在统计学中,标准分数是一个观测或数据点的值高于被观测值或测量值的平均值的标准偏差的符号数。
Set the values of population mean and population standard deviation to 5 and and 1.3 respectively
mu,sigma=5,1.3
#1.3 standard deviation away from the population of 5
#calculate the sample mean to 5.25
x_bar=5.25
#calculate the test statistic
#sample mean - population mean
test_stat=(x_bar - mu)/(sigma/np.sqrt(45))
test_stat
1.2900392177883402
Introduction of Rejection Acceptance Region / p-value
Though the probability is small, we cannot conclude whether the evidence is significant enough to reject the null hypothesis in favour of alternate hypothesis or not. To determine it, we use either one of the following approaches:
1. Rejection region approach
2. p-value approach
1. Rejection Region Approach
For this approach, we need to follow the below steps.
1. we choose a value of level significance (α).
α is the probability of rejecting the null hypothesis if it is true.
2. then we find the rejection region in the graph
3. We reject the null hypothesis if the test statistics falls in the rejection region. Else, we don't reject the null hypothesis.
In the given example, the Z test statistic follows a standard normal distribution as shown in the above plot. The Z value lying in the right end of the distribution gives strong evidence against the null hypothesis. To find the rejection region, we will find the value of Z (called critical value) that gives an area of α to the right end.
from scipy.stats import norm
critical_val=norm.ppf(1-.05) #the chance of being above it is 5%
critical_val
1.6448536269514722
x=np.linspace(-4,4,100)
plt.plot(x,norm.pdf(x,0,1))
plt.axvline(x=critical_val,c='r')
x1=np.linspace(critical_val,4,50)
plt.fill_between(x1,norm.pdf(x1,0,1),color='r')
plt.annotate('Reject Null',(2,0.20))
plt.annotate('Do Not Reject\n Null',(-1,0.2))
plt.show()
#as our test statistic (~1.29) does not lie in the rejection region, we cannot reject the null hypothesis. Thus we do not have statistical evidence to say that mean delivery time of a product is greater than 5 days.
#calculate the p-value
1-norm.cdf(test_stat)
0.09851852092578695
as 0.098 is greater than the level of significance 0.05, we cannot reject the null hypothesis. Thus we do not have statitical evidence to say that the mean delivery time of a product is greater than 5 days.
Key takeaway
- We get the same result by using both the Rejection Region and P-value approach that the manager does not have enough statistical evidence to say that the mean delivert time of a product is greater than 5 days.
Exercise:
level of significance: It is the probability of rejecting the null hypothesis when it is true and it is fixed before hypothesis test
The p-value is the probability of observing the test statistic or more extreme results under the null hypothesis.
The z-test statistic follows a Standard Normal distribution