ml聚合_ml假设检验

最新推荐文章于 2022-09-16 11:09:15 发布

weixin_26728833

最新推荐文章于 2022-09-16 11:09:15 发布

阅读量754

点赞数

文章标签： python

原文链接：https://medium.com/swlh/ml-hypothesis-testing-ccbe52cf3108

版权

ml聚合

As data scientists, we need to know the proper way to build a hypothesis and test it with the tools that we learn. This post will guide you to build a proper and solid hypothesis.

作为数据科学家，我们需要了解构建假设并使用我们学习的工具对其进行测试的正确方法。这篇文章将指导您建立正确而可靠的假设。

最小描述长度(MDL) (Minimum Description Length (MDL))

This is a simple concept. It means if you want to build a precise model, then the model will have small errors but it will have also the complexity of the model. If you want to build a simple model, then the model will have high errors. Always, the model complexity and the precision of the model is a trade-off because more bits are needed to build the precise model and more bits mean the complex model. Our goal is to build a model that has small errors and not a big complex model. This is related to Occam's razor.

这是一个简单的概念。这意味着，如果您想构建一个精确的模型，则该模型将具有较小的误差，但同时也将具有模型的复杂性。如果要构建一个简单的模型，则该模型将具有很高的误差。通常，模型的复杂性和模型的精度是一个折衷，因为需要更多的位来构建精确的模型，而更多的位则意味着复杂的模型。我们的目标是建立一个误差较小而不是较大复杂模型的模型。这与奥卡姆的剃刀有关。

建立假设和置信区间 (Building Hypothesis and Confidence Interval)

Let’s think about the example case, we are trying to measure the height of the students in the two different high schools and we know the result, means are 175cm for school A and 177cm for school B. This result is from the 50 students as a sample from each school. Can you tell the students in school B is taller than the students in school A? No. The answer is we don’t know. How does the data scientist answer those kinds of questions properly? Now, I will explain how we answer it step by step.

让我们考虑这个示例，我们正在尝试测量两个不同高中学生的身高，我们知道结果，平均值是A学校175cm和B学校177cm。此结果来自50名学生每个学校的样本。您能告诉学校B中的学生比学校A中的学生高吗？不。答案是我们不知道。数据科学家如何正确回答这类问题？现在，我将逐步解释如何回答。

The very first thing we need to do is to build a null hypothesis and an alternative hypothesis. The null hypothesis will be the information that we already know or previous theory and an alternative hypothesis will be the new theory or information that we are trying to know. The whole process assumes the null hypothesis is correct and if we find the extreme case that has really low probability and represents the alternative hypothesis, then we reject the null hypothesis and accept the alternative hypothesis. How much extreme is needed for rejecting the null hypothesis? The statisticians decide to call it p-value or out of confidence interval. If the statistics are in the confidence interval, it means it can happen in the null hypothesis. Therefore, we need to keep the null hypothesis. If not, then we reject the null hypothesis.

我们需要做的第一件事是建立一个原假设和一个替代假设。零假设将是我们已经知道的信息或先前的理论，替代假设将是我们想要知道的新理论或信息。整个过程假设零假设是正确的，如果我们发现概率极低且代表替代假设的极端情况，则我们拒绝零假设并接受替代假设。否定原假设需要多少极端？统计人员决定将其称为p值或超出置信区间。如果统计数据在置信区间内，则意味着它可以在原假设中发生。因此，我们需要保留原假设。如果不是，则我们拒绝原假设。

So, we need to decide the hypothesis and confidence interval before the experiments. This is very important because you can previously conclude your result and fit your data, please don’t do that. People usually ask for a 99% confidence interval.

因此，我们需要在实验之前确定假设和置信区间。这非常重要，因为您可以事先得出结论并拟合数据，请不要这样做。人们通常要求99％的置信区间。

收集数据的实验 (Experiments for collecting data)

Image for post — Observation Experiments are major in data science.

Now, we got a hypothesis and confidence interval. We should design an experiment to get the data. There are two types of experiments, manipulation experiments, and observation experiments. Manipulation experiments are typical science experiments to control the experiments and compare the groups. Observation experiments are finding out the association between the given data and we do not control the data or subjects. Most of the big data consists of observation data because it is hard to control the big amount of data.

现在，我们有了一个假设和置信区间。我们应该设计一个实验来获取数据。有两种类型的实验：操纵实验和观察实验。操纵实验是控制实验并比较各组的典型科学实验。观察实验正在发现给定数据之间的关联，我们无法控制数据或受试者。大多数大数据由观察数据组成，因为很难控制大量数据。

We need to decide which feature or metric will be the dependent variable that we are interested in. In the former example, it will be the height of the students. The dependent variable can be more than one. We also need to define the independent variable, it could be anything related to our interest. It can be nutrition or genetic information of the students in our example case. There is a third type of variable, extraneous variable. It affects the dependent variable but we are not interested. Therefore, we need to control this. In our case, it can be time because people are taller in the morning than in the evening. So that we need to fix the time. You should take care of the ceiling effects, order effects, and sampling bias.

我们需要确定哪个要素或度量将是我们感兴趣的因变量。在前一个示例中，它将是学生的身高。因变量可以大于一个。我们还需要定义自变量，它可能与我们的兴趣有关。在我们的例子中，它可以是学生的营养或遗传信息。第三种变量，无关变量。它影响因变量，但我们不感兴趣。因此，我们需要对此进行控制。在我们的情况下，可能是时候了，因为人们早上起来比晚上更高。因此，我们需要确定时间。您应该注意上限效果，顺序效果和采样偏差。

Caveats: Most of the cases, we skip experiments because we only got the data from the clients or other repository.

注意事项：在大多数情况下，由于仅从客户端或其他存储库获取数据，因此我们跳过了实验。

探索性数据分析 (Exploratory Data Analysis)

We finished our experiments and we get our data. Before we build a model, we need to analyze the data itself to figure out more details in the data.

我们完成了实验，并获得了数据。在建立模型之前，我们需要分析数据本身以找出数据中的更多细节。

Clustering shows you how the data is clumped up.
聚簇显示您如何聚集数据。
Binning and histogram to look at how the data is distributed.
装箱和直方图以查看数据的分布方式。
Simple regression fits to figure out the linearity.
简单回归适合找出线性。
Correlation analysis to drop out or manipulate the redundancy features
进行相关分析以删除或操纵冗余功能

建立模型并检查指标或相关功能。 (Build a model and check the metric or dependent features.)

This part is really dependent on your experiments and EDA. What model you choose is up to you and you have to understand what is the algorithm inside of it and its limitation.

这部分实际上取决于您的实验和EDA。选择哪种模型取决于您，您必须了解其中的算法是什么以及它的局限性。

假设检验和参数估计 (Hypothesis testing and Parameter Estimation)

The goal is hypothesis testing to infer the performance of the algorithm on the population with the test on the sample data. The truth is in population but what we got is only the sample from the experiments. Test on a sample will give you statistics and the population have the parameter. We need to infer the parameter with the statistics with the prediction error. Average vs Mean is the most well-known example of statistics and parameter. What factors can influence this prediction error:

目的是进行假设测试，以通过对样本数据进行测试来推断算法在总体上的性能。事实是人口众多，但我们得到的只是实验中的样本。对样本进行测试将为您提供统计数据，并且总体具有参数。我们需要通过带有预测误差的统计信息来推断参数。平均值与平均值是最著名的统计信息和参数示例。哪些因素会影响此预测误差：

Sample Size, under my control
样本量由我控制
The variance of the underlying distribution, out of my control
基础分布的方差，超出我的控制范围

We can do a comparison between the metrics or features of algorithms testing, there are many testing methods, with the confidence interval that we already defined above. => hypothesis testing
我们可以在算法测试的度量或功能之间进行比较，有很多测试方法，并且我们已经在上面定义了置信区间。 =>假设检验
We can estimate the parameter with the interval representing the amount of confidence around my statistics. The true parameter will lie in that interval. => Parameter estimation
我们可以估计参数，其间隔代表我的统计数据的置信度。 true参数将位于该间隔内。 =>参数估计

例 (Example)

Formulate a null hypothesis, H0: A = B, H1: A < B | A is the height of the student in school A, B is the height of the student in school B. We assume that we know the true value of B for the simple explanation. B = 176
提出零假设，H0：A = B，H1：A <B | A是学校A中学生的身高，B是学校B中学生的身高。为简单说明，我们假定我们知道B的真实价值。 B = 176
Run a sample of size N through A. Measure the height of the student in the school A and get the mean of A.
对A进行大小为N的样本。测量学校A中学生的身高，得到A的平均值。
Assume the null hypothesis(H0) is true and estimate the distribution of the mean of the sample.
假设原假设(H0)为真，并估计样本均值的分布。
Calculate the probability of obtaining the sample mean given H0.
计算获得给定H0的样本均值的概率。

P(the mean of the sample| H0)
P(样本的平均值| H0)
If the P(the mean of the sample| H0) is too low, reject H0 in favor of H1
如果P(样本的平均值| H0)太低，则拒绝H0，而选择H1

The next post will discuss specific testing methods.

下一篇文章将讨论特定的测试方法。

This post is published on 9/8/2020

此帖发布于9/8/2020

翻译自: https://medium.com/swlh/ml-hypothesis-testing-ccbe52cf3108

ml聚合

weixin_26728833

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
ml聚合_ml假设检验

ml聚合As data scientists, we need to know the proper way to build a hypothesis and test it with the tools that we learn. This post will guide you to build a proper and solid hypothesis. 作为数据科学家，我们需要了解构建...
复制链接

扫一扫