消解原理推理_什么是推理统计中的Z检验及其工作原理?

消解原理推理

I Feel:

我觉得:

The more you analyze the data the more enlightened, data engineer you will become.

您对数据的分析越多,您将变得越发开明。

In data engineering, you will always find an instance where you need to establish whether the data sample which you have got from population data, is reliable enough to build a model around it. There can be an instance where you may have got the data from the old archive, which may not represent the true behavior of process modeled around it in a production environment, with time behavior changes and so the process on which model was built.

在数据工程中,您将始终找到一个实例,需要在该实例中确定从总体数据中获取的数据样本是否足够可靠以围绕该数据模型建立模型。 在某些情况下,您可能已经从旧的存档中获取了数据,这些数据可能无法表示生产环境中围绕它建模的流程的真实行为,并且行为会随时间变化,因此建立模型的流程也会随之变化。

So if we go ahead and build our new model around such old sample data, we may end up with a faulty process and the model will not be effective or useful. So what we do is to perform a certain inferential statistical test to ensure data is reliable.

因此,如果我们继续围绕这样的旧样本数据构建新模型,则可能会导致过程出错,并且该模型将无效或无用。 因此,我们要做的是执行某种推断统计检验,以确保数据可靠。

One such test is the Normal Deviate Z Test, where we test our sample data to infer if it has come from the population data which is a true representation of process behavior in a production environment before we go-ahead to build a model around it.

一种这样的测试是Normal Deviate Z Test 我们在这里测试示例数据以推断它是否来自于总体数据,这是生产环境中过程行为的真实表示,然后我们继续围绕它建立模型。

Earlier in part 1 of Inferential statistics, we learned about the Chi-Square test

在推论统计的第1部分之前,我们了解了卡方检验

I would invite you all to read the same. As promised, today we will cover more statistical testing techniques being used in inferential statistic hypothesis testing to establish sample data reliability. So let’s get started with understanding one such test called normal deviate Z Test which we will be covering in detail moving forward in our journey.

我请大家阅读相同的内容。 如所承诺的那样,今天我们将介绍用于推断统计假设检验的更多统计检验技术,以建立样本数据的可靠性。 因此,让我们开始理解一种称为“普通偏差Z测试”的测试,我们将详细介绍其前进的过程。

什么是标准偏差Z测试及其工作原理? (What Is Normal Deviate Z Test & How It Works?)

When we try to establish data reliability of a large sample data set (sample size > 30 is the norm)using Normal deviate Z test we try to compare two distribution means of data like the given sample data in our data science project and the production data.

当我们尝试使用正态偏差Z检验来建立大型样本数据集(样本量大于30的范数)的数据可靠性时,我们尝试比较两种分布方式的数据,例如数据科学项目中的给定样本数据和生产数据。

The Z-test compares sample and population means to determine if there is a significant difference.The Z test statistic is assumed to have a normal distribution, and nuisance parameters such as standard deviation should be known in order to perform an accurate z-test

Z检验比较样本和总体均值以确定是否存在显着差异。假定Z检验统计量具有正态分布,并且应该知道有害参数(例如标准差)以执行准确的Z检验

正态偏差Z测试如何工作? (How Normal Deviate Z test Work?)

We will understand how Z test functions in the following steps

我们将在以下步骤中了解Z测试的功能

第一步:建立假设: (Step1: Establishing Hypothesis:)

It is the first thing data engineers need to state before we go to perform any statistical test in inferential statistics.

在进行推理统计中的任何统计测试之前,这是数据工程师需要陈述的第一件事。

H0 — The difference in means between sample variable and population mean is a statistical fluctuation.

H0-样本变量和总体平均值之间的均值差异是统计波动。

# H1 — The difference in means between sample BP column and population mean is significant. The difference is too high to be the result of statistical fluctuation

#H1-样本BP列与总体平均值之间的均值差异显着。 差异太高,无法由统计波动得出

步骤2:计算Z检验统计量 (Step 2: Calculating Z test statistic)

Before we calculate, here are the required

在我们计算之前,这是必需的

Pre-Requisites: In-order to perform Z test on a normal distribution of data, there are some prerequisites:

先决条件:所有以上的数据的正态分布进行Z检验,也有一些先决条件:

  • Number of samples >= 30,

    样本数量> = 30,
  • The mean and standard deviation of the population should be known

    应该知道总体的平均值和标准偏差

计算的Z检验统计公式: (Z-test statistic Formula For Calculation:)

The Z measure is calculated as:

Z度量的计算公式为:

Z = (M — μ)/ SE

Z =(M —μ)/ SE

Where, M is the mean sample to be standardized, μ(mu) is the population mean and SE is the standard error of the mean.

其中,M是要标准化的平均样本,μ(μ)是总体平均值,SE是平均值的标准误差。

SE is calculated using the below-given formula:

SE使用以下给出的公式计算:

SE = s/ SQRT(n)

SE = s / SQRT(n)

Where, s is the population standard deviation and n is the sample size.

其中, s是总体标准差,n是样本量。

Standard_error is the standard deviation of the sample distribution of means (Central Limit Distribution)

Standard_error是均值样本分布(中央限制分布)的标准偏差

The above-given formula may look very similar to Z score calculation as both Z score calculation and Z Norm_dev is an instance of a test of statistical significance

上面给出的公式可能看起来与Z分数计算非常相似,因为Z分数计算和Z Norm_dev都是统计显着性检验的一个实例

步骤3:分析Z值以解释P值 (Step 3: Analyze the Z value to interpret P-Value)

Once we have the Z value we go ahead to calculate the p-value, based on which we will be able to accept or reject the null hypothesis.

一旦获得Z值,我们便可以计算p值,以此为基础我们可以接受或拒绝原假设。

使用Python和Jupyter Notebook的示例: (Example Using Python & Jupyter Notebook:)

So let’s try to understand the above-given steps using a practical example.

因此,让我们尝试通过一个实际示例来了解上述步骤。

安装Anaconda发行版: (Install Anaconda distribution:)

By following the given link anaconda download the latest version for a python based on your OS. This will come up with a pre-installed Jupyter notebook and required python packages likes pandas, SciPy, etc.

通过点击给定的链接, anaconda会根据您的操作系统为python下载最新版本。 这将附带一个预安装的Jupyter笔记本和必需的python包(如pandasSciPy等)。

Once you are done with the installation, launch your Jupyter notebook and write the following code(copy the below code ) to get started.

安装完成后,启动Jupyter笔记本并编写以下代码(复制以下代码)以开始使用。

Import The Required Package:

导入所需的软件包:

Let’s important some relevant python packages as shown below and create a data frame by reading “pima-indians-diabetes.csv” sourced from a Kaggle

让我们重要一些相关的python软件包,如下所示,并通过阅读来自Kaggle的pima-indians-diabetes.csv ”来创建数据框

import pandas as pd
import numpy as np
import scipy
import matplotlib.pyplot as plt
import scipy.stats as st
import seaborn as sns#Reading CSV file into df as pandas dataframedf= pd.read_csv(“pima-indians-diabetes.csv”)

Let’s view the data frame by calling the method df.head(20) to view the data series in the given sample data set.

让我们通过调用方法df.head(20)来查看数据帧,以查看给定样本数据集中的数据系列。

df.head(20)
Image for post

步骤1:让我们制定零假设和替代假设: (Step 1: Let’s Formulate Our Null and Alternate Hypothesis:)

零假设: (Null Hypothesis:)

# H0: The difference in the mean between sample BP(Press column visible above in the data frame table ) column and population mean for BP is a statistical fluctuation.

#H0:样本BP(在数据框表中上方可见的Press列)列与BP总体平均值之间的平均值差是统计波动。

替代假设: (Alternate Hypothesis:)

# H1 — The difference in Mean between sample BP column and population mean is significant, and is not a case of mere statistical fluctuation

#H1-样本BP列与总体均值之间的均值差异非常大,而不仅仅是统计上的波动

步骤2:让我们计算Z Stat(Z测试): (Step 2: Let’s Calculate Z Stat (Z Test):)

As we have already discussed Z stats formula,

正如我们已经讨论过的Z统计公式一样,

Z =(M —μ)/ SE (Z = (M — μ)/ SE)

Where, M is the mean sample to be standardized, μ(mu) is the population mean and SE is the standard error of the mean.

其中,M是要标准化的平均样本,μ(μ)是总体平均值 ,SE是平均值的标准误差。

So let’s do this calculation in Jupyter Notebook :

因此,让我们在Jupyter Notebook中进行以下计算:

这是计算Z测试的代码片段: (Here is the code snippet to calculate Z test:)

# Pre - Requisites -  Number of samples >= 30, the mean and standard deviation of population should be known# Here we have  Avg and Standard Deviation for  diastolic blood pressure = 71.3 with standard deviation of 7.2 

## Let's Apply of Normal Deviate Z test on blood pressure(Press) column of given dataframe#mu = μ
mu = 71.3 # source - http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_BiostatisticsBasics/BS704_BiostatisticsBasics3.html
std = 7.2#Let's find the M, mean of BP column(Press) in a given data frameMeanOfBpSample = np.average( df['Pres'])
print("Mean Of BP Column", MeanOfBpSample)SE= std/np.sqrt(df.size) #sf.size id the total size of# Z_norm_deviate = sample_mean - population_mean /std_error_bpZ_norm_deviate = (MeanOfBpSample - mu) / SEprint("Normal Deviate Z Value: ", Z_norm_deviate)

If you type the above code in your notebook you will be able to see the below-given output

如果您在笔记本中键入上面的代码,您将能够看到以下给出的输出

Mean Of BP Column 69.10546875
Standard Error: 0.08660254037844387
Normal Deviate Z value : -25.340264158650886
Image for post

Now that we know the Z Test Value, let’s find our p-value

现在我们知道了Z检验值,让我们找到p值

计算P值,代码段: (Calculating P-Value, Code Snippet:)

# We will be using scipy stats normal survival function sf
#Here we mulitply the sf fucntion with 2 for two sided p value #calcultion , a two tail testp_value = scipy.stats.norm.sf(abs(Z_norm_deviate))*2
print('p values' , p_value)if p_value > 0.05:
print('Samples are likely drawn from the same distributions (fail to reject H0)')
else:
print('Samples are likely drawn from different distributions (reject H0)')

If you run the above code snippet in Jupyter you will get the following outcome:

如果在Jupyter中运行上述代码片段,您将得到以下结果:

Image for post

步骤3:分析Z值以解释P值 (Step 3: Analyze the Z value to interpret P-Value)

As you can see above, the p-value comes out to be: 1.150581011903455e-141. As the p-value is less than the accepted industry standard of 0.05, we can conclude that the given sample has not come from the same population distribution, on which the process was built. There is a significant difference in Means between sample BP column and population mean, so we have to reject the Null hypothesis H0, and accept the alternate hypothesis: H1.

如上所示, p值显示为: 1.150581011903455e-141。 由于p值小于公认的行业标准0.05,因此可以得出结论,给定的样本并非来自建立该过程的相同总体分布。 样本BP列与总体均值之间的均值存在显着差异,因此我们必须拒绝零假设H0,并接受替代假设H1。

As we reject the null hypothesis here using the normal Z deviate test, it will be recommended to avoid building an ML model on this sample data.

由于我们在这里使用常规Z偏差检验拒绝零假设,因此建议避免在此样本数据上建立ML模型。

Aspiring/working data engineers need to have a clear understanding of p-value. This will be the basis of performing most of the statistical data reliability tests. So let me quickly cover a few basic stuff about the same here and we will look into it more deeply in the special article which I will frame only around P-value for you all.

有抱负/工作数据的工程师需要对p值有清晰的了解。 这将是执行大多数统计数据可靠性测试的基础。 因此,让我在这里快速介绍一些基本知识,我们将在特别文章中对其进行更深入的研究,我将只为大家介绍P值

什么是P值? (What Is P-Value?)

The p-value, or probability value, tells us the probability of getting a value as small or as large as the one observed in the sample, given that our null hypothesis is true.

假设我们的零假设是真实的,则p值或概率值告诉我们获得与样本中观察到的值一样小的值的概率。

一般如何计算p值? (How to calculate p-value in general?)

  1. Frame your hypothesis

    阐明你的假设

  2. Assume the null hypothesis to be true

    假设原假设为真

  3. Calculate the z or t value for getting the value in the alternative hypothesis

    计算z或t值以获取替代假设中的值

  4. From the z/t-table, find the probability associated with the z or t value obtained above. You can also find p-value with Scipy inbuilt methods you just need to pass z, t statistics calculated in step 3.

    从z / t表中,找到与上面获得的z或t值关联的概率。 您还可以使用Scipy内置方法找到p值,只需传递步骤3中计算的z,t统计信息即可。

  5. This is the p-value you need to find

    这是您需要找到的p值

We will cover P-value calculations, how to interpret it and its use cases separately later on. Also, you will also experience it while we cover all the hypothesis test types in our journey of understanding inferential statistics.

稍后我们将分别介绍P值计算,如何解释它以及其用例。 同样,当我们在理解推理统计的过程中涵盖所有假设检验类型时,您还将体验到它。

下一步是什么? (What’s Next?)

In our next article: “Inferential Statistics: Hypothesis Testing using T-Test”. We will cover the T-test in detail.

在我们的下一篇文章: “推理统计:使用T检验的假设检验”。 我们将详细介绍T检验

Would like to leave you all by covering some basics of the T-test.

希望通过介绍T检验的一些基础知识来使您满意。

什么是T检验? (What Is T-Test?)

A t-test is a kind of inferential statistic used to find if there is a significant difference between the means of two given groups, which may be related to certain features.

t检验是一种推论统计量,用于发现两个给定组的均值之间是否存在显着差异,这可能与某些特征有关。

A t-test looks at the t-statistic, the t-distribution values, and the degrees of freedom to determine the probability of difference between two sets of data

t检验检查t统计量,t分布值和自由度,以确定两组数据之间的差异概率

T检验的类型: (Types Of T-Test:)

There are three types of t-test:

t检验分为三种类型:

一样本t检验: (One-sample t-test:)

Used to compare a sample mean with a known population mean or some other meaningful, fixed value

用于将样本平均值与已知总体平均值或其他有意义的固定值进行比较

独立样本t检验: (Independent samples t-test:)

Used to compare two means from independent groups

用于比较独立组的两种均值

配对样本t检验: (Paired samples t-test:)

  1. Used to compare two means that are repeated measures for the same participants — scores might be repeated across different measures or across time.

    用于比较两个均值是针对同一参与者的重复测量方法-得分可能会在不同测量值或时间之间重复。
  2. Used also to compare paired samples, as in a two treatment randomized block design.

    也用于比较成对的样本,如两次治疗的随机区组设计。

Will cover how we perform the above-given T-test using examples and hands-on lab exercises.

将通过示例和动手实验练习介绍我们如何执行上述T检验。

Do refer below given graphics that cover the decision making tree to help you chose the right kind of hypothesis testing based on the given problem statement.

请参考下面给出的覆盖决策树的图形,以帮助您根据给定的问题陈述选择正确的假设检验。

如何决定何时使用什么测试? (How To Decide What Test To Use & When?)

Image for post

摘要: (Summary:)

Never ever rely on plain observation or assumption while you try to build a model on the given sample. Make sure you are measuring it’s distribution type, testing the data sample using statistical hypothesis testing to ensure your sample data is reliable. Descriptive statistics & inferential statistical techniques are designed to help you make better decisions, in data sampling before modeling it in machine learning.

尝试在给定样本上建立模型时,切勿依赖单纯的观察或假设。 确保您正在测量其分布类型,使用统计假设检验来测试数据样本,以确保样本数据可靠。 描述性统计和推论统计技术旨在帮助您在数据采样之前在机器学习中建模之前做出更好的决策。

As data cleansing, EDA will fill larger part of your work life as a data scientist, it’s imperative that you take responsibility of handling data with utmost clarity & care to test it out for its reliability. You are going to influence the market dynamics in a larger way, as your model is going to take some really critical business decisions.

在清理数据时,EDA将占据数据科学家一生的大部分时间,当务之急是您要以最清晰,最谨慎的态度处理数据,以测试其可靠性。 您将以更大的方式影响市场动态,因为您的模型将做出一些非常关键的业务决策。

我觉得 : (I Feel :)

Going wrong with data interpretations while building ML models may cost heavily. So don’t just build models for the sake of building, make sure it has been fed with the right kind of food in terms of data. Your right data feeding habit will do wonders when your machine will make intelligent & precise, ML based predictions and recommendations for your business . Everybody in the ecosystem will be the beneficiary of the right model building process if it’s done right.

在构建ML模型时,数据解释出错可能会耗费大量资金。 因此,不要仅仅为了构建模型而建立模型,还要确保它已经在数据方面获得了正确的选择。 您正确的数据馈送习惯将使您的机器何时能够为您的业务做出智能,精确的,基于ML的预测和建议,这会产生奇迹。 如果做得正确,生态系统中的每个人都将是正确的模型构建过程的受益者。

翻译自: https://medium.com/swlh/what-is-z-test-in-inferential-statistics-how-it-works-3dde6eae64e5

消解原理推理

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值