下采样训练集和下采样测试集_ml采样分布z测试

最新推荐文章于 2022-03-02 21:08:42 发布

weixin_26735933

最新推荐文章于 2022-03-02 21:08:42 发布

阅读量331

点赞数

文章标签：机器学习计算机视觉

原文链接：https://medium.com/swlh/ml-sampling-distribution-z-test-550ee1838762

版权

下采样训练集和下采样测试集

In the previous post, we look at how we build hypothesis testing and experiments. In this post, we start to look at the specific methods for it. The first method we are going to study is Z-test.

在上一篇文章中，我们介绍了如何构建假设检验和实验。在这篇文章中，我们开始研究它的具体方法。我们要研究的第一种方法是Z检验。

抽样分布 (Sampling Distribution)

before Z-test, we need to know what is sampling distribution and how we can build it. A sampling distribution is a probability distribution of a statistic obtained from a larger number of samples drawn from a specific population. There are three ways to build this:

在进行Z检验之前，我们需要了解什么是抽样分布以及如何构建它。 抽样分布是从特定群体中抽取的大量样本获得的统计量的概率分布。有三种构建方法：

Simulations, doing an experiment repeatedly. => tossing coin.
模拟，反复进行实验。 =>扔硬币。
Analytically, we know the underlying distribution enough to build the sampling distribution. => we measure the weight or symmetry of the coin and calculate the sampling distribution with statistical techniques.
从分析上讲，我们知道基本分布足以构建采样分布。 =>我们测量硬币的重量或对称性，并使用统计技术计算采样分布。
Central Limit Theorem, when mean is the parameter that we are going to estimate. Regardless of the underlying distribution in which the data is drawn, the sampling distribution will be a normal distribution. The mean of the sampling distribution will be the mean of the population. The standard deviation will be σ/ the root of the number of sample. Thus, larger ths sample size, narrower the sampling distribution.
中心极限定理，当均值是我们要估计的参数时。不管在其中绘制数据的基础分布如何，采样分布都是正态分布。抽样分布的平均值将是总体的平均值。标准偏差为σ/样本数的根 。因此，样本量越大，采样分布越窄。

Tips: The standard deviation of the sampling distribution from CLT is also called the standard error. When n is bigger than 30, we practically call it large enough because the standard error is small. However, the underlying distribution have very high variance, we need very large samples. In this case we can solve the high variance in the underlying distribution by using binning the data. It removes small variation of the data.

提示：样本分布与CLT 的标准偏差也称为标准误差 。当n大于30时，由于标准误差很小，我们实际上将其称为足够大。但是，基础分布具有很高的方差，我们需要非常大的样本。在这种情况下，我们可以通过对数据进行装箱来解决基础分布中的高方差。它消除了数据的微小变化。

What we are going to do:

我们要做的是：

Let’s say you toss the coin and the head appears 40 times in 100 trials. Can you say the coin is fair? Let’s say you toss the coin and the head appears 10 times in 100 trials. Can you say the coin is fair?? If you are a normal person, you have to blame the dealer of the gamble. How do you feel this? How do you judge it? We are trying to do this inference with formal language and mathematics. This is what we are going to do. This is almost everything. The null hypothesis will be the coin is fair and the alternative hypothesis will be the coin is not fair.

假设您抛硬币，在100次试验中头部出现40次。你能说硬币很公平吗？假设您抛硬币，在100次试验中头部出现了10次。你能说硬币很公平吗？如果您是一个普通人，就必须责怪赌博者。您感觉如何？您如何判断？我们试图用形式语言和数学进行这种推断。这就是我们要做的。这几乎就是一切。零假设将是硬币是公平的，而替代假设将是硬币是不公平的。

Formal way:

正式方式：

we build the distribution given the null hypothesis is true, the coin is pair. calculate the all probability of the head when you toss the coin 100 times. the probability from 1 head to 100 heads is easy to be calculated.
在零假设为真，硬币为对的情况下，我们建立分布。计算抛硬币100次时头部的所有概率。从1个头到100个头的概率很容易计算。
Doing an experiment with the coin, tossing 100 times, and recording the head.
用硬币做一个实验，扔100次并记录头部。
Match the experiment with the distribution and look at the probability. If you don’t like that probability(p-value), then you reject your null hypothesis.
使实验与分布匹配，并查看概率。如果您不喜欢该概率(p值)，那么您将拒绝原假设。

Note: If you wrongly set up the alternative hypothesis, every result can be useless.

注意：如果错误地设置了备用假设，则每个结果都可能是无用的。

Z检验 (Z-test)

It specifically uses the sampling distribution of the mean from CLT. Let’s look at this with example. Let’s assume I am a professor, what a beautiful future. I know how much time the students of mine will takes to solve the problem and there is a new problem never seen before but my students takes 2.8 times longer time than the previous problem. I want to know the new problem is harder than the previous one.

它专门使用CLT的均值采样分布。让我们以示例来看一下。假设我是一名教授，多么美好的未来。我知道我的学生要花多少时间来解决问题，并且有一个从未见过的新问题，但是我的学生所花的时间是前一个问题的2.8倍。我想知道新问题比上一个问题难。

H0: The previous problem and the new problem is the same.

H0：以前的问题和新的问题是相同的。

H1: the new problem is harder than The previous problem.

H1：新问题比以前的问题难。

Let’s say I know, the mean and standard deviation of spending time by student for the previous problem => 1.0 and 0.948

假设我知道，上一个问题的学生花费时间的均值和标准差=> 1.0和0.948

Since I will be a notrious professor, I give the students 25 the new problem to solve it and record the time. Its results: mean = 2.8 (N=25).

由于我将是一名臭名昭著的教授，所以我给学生25个新问题来解决它并记录时间。其结果：平均值＝ 2.8(N ＝ 25)。

We are looking at the mean. Therefore, the sampling distribution will follow the CLT case, it means it is the normal distribution, mean is 1.0 and the standard deviation is 0.948/5. What is probability of the result based on sampling distribution? We can calculate with this distribution. However, we can calculate it with slightly more convenient form. It calculate the Z-score, this is why we call this test Z-test.

我们正在寻找平均值。因此，采样分布将遵循CLT情况，这意味着它是正态分布，均值为1.0，标准偏差为0.948 / 5。根据抽样分布得出结果的概率是多少？我们可以用这种分布进行计算。但是，我们可以使用更方便的形式进行计算。它计算Z分数，这就是为什么我们将此测试称为Z检验。

We convert the sampling distribution into the standard normal distribution, mean is 0 and std follows the unit std. In our case, the Z-score is 9.47((2.8–1.0)/0.19).

我们将采样分布转换为标准正态分布，平均值为0，std遵循单位std。在我们的例子中，Z值为9.47((2.8–1.0)/0.19)。

You can find the probability in Z-score table. The probability of ours is too low. It is impossible to express it in one page. There are famous number in this table, 1.65, 1.96 and 2.575. It coresponds to 10%, 5% 1% in one tail, you need to know it only takes only one tail.

您可以在Z得分表中找到概率。我们的可能性太低了。无法在一页中表达它。该表中有著名的数字，分别为1.65、1.96和2.575。它在一条尾巴上分别对应10％，5％1％，您需要知道它仅需一条尾巴。

We finished our experiment perfectly. Do you think you can reject the null hypothesis? I think it is enough. In many case, the data scientist follows the 1.96, 0.05 or 2.575, 0.01(p-value). So, we reject the null hypothesis and my student can tell me change the exam.

我们完美地完成了实验。您认为您可以拒绝原假设吗？我认为就足够了。在许多情况下，数据科学家遵循1.96、0.05或2.575、0.01(p值)。因此，我们拒绝零假设，我的学生可以告诉我更改考试。

Notes: p-value means the probability that I am wrong when I decide the conclusion, reject the null hypothesis. We assume many things. Thus, we have to follow small p-value.

注意： p值表示我决定结论，拒绝原假设时出现错误的概率。我们假设很多事情。因此，我们必须遵循小的p值。

This post is published on 9/9/2020

此帖发布于9/9/2020