r语言解释回归模型的假设_模型假设-解释

最新推荐文章于 2023-04-16 09:47:18 发布

weixin_26752765

最新推荐文章于 2023-04-16 09:47:18 发布

阅读量1.7k

点赞数

文章标签： python 机器学习人工智能深度学习 java

原文链接：https://towardsdatascience.com/model-assumptions-explained-2c7bb7607f1c

版权

r语言解释回归模型的假设

Ever heard of model assumptions? What are they? And why are they important? A model is a simplified version of reality, and with machine learning models this is not different. To create models, we need to make assumptions, and if these assumptions are not verified and met, we may get into some trouble.

听说过模型假设吗？这些是什么？为什么它们很重要 ？模型是现实的简化版本，对于机器学习模型而言，这没有什么不同。要创建模型，我们需要做出假设，如果这些假设没有得到验证和满足，我们可能会遇到麻烦。

If these assumptions are not verified and met, we may get into some trouble.

如果这些假设没有得到证实和满足，我们可能会遇到麻烦。

Every (machine learning) model has a different set of assumptions. We make assumptions on the data, on the relationship between different variables, and on the model we create with this data. Most of these assumptions can actually be verified. So one thing you’ll always want to do is ask whether the assumptions have been verified. Some assumptions are only relevant for making conclusions about relationships (e.g. a 1-degree increase in temperature shows a 4% increase in ice-cream sales), and others are also relevant to predict outcomes (we predict ice cream sales of x tomorrow).

每个(机器学习)模型都有不同的假设集。我们对数据，不同变量之间的关系以及由此数据创建的模型进行假设。 这些假设大多数都可以得到验证。 因此，您始终想做的一件事就是询问这些假设是否已得到验证。一些假设仅与得出有关关系的 结论有关(例如，温度每升高1度，冰淇淋销售量就会增加4％)，而其他假设也与 预测结果 有关 (我们预测明天x的冰淇淋销售量)。

Most of these assumptions can actually be verified.

这些假设中的大多数实际上可以得到验证。

Let’s go through the assumptions that are made for the simplest model out there. The linear regression.

让我们来看一下为最简单的模型所做的假设。线性回归。

假设1：固定回归器 (Assumption 1: fixed regressors)

What this actually means is that we assume that the variables (input data) are not random variables but fixed numbers and that if we rerun the experiment (we collect the data again in the same manner), we expect the same results.

这实际上意味着我们假设变量(输入数据)不是随机变量，而是固定数字，并且如果我们重新运行实验(我们以相同的方式再次收集数据)，则预期结果相同。

The opposite of fixed regressors is a random (or stochastic) regressor, which is typically looked at as data sampled from a wider population. Now if this is the case, then you can only make conclusions ‘conditional’ on the data. Meaning you can draw the same conclusions, but only on this data. You cannot generalize outside of your dataset.

固定回归变量的对立面是随机(或随机)回归变量，通常将其视为从更广泛的人群中采样的数据。现在，如果是这种情况，那么您只能对数据做出“有条件的”结论。意味着您可以得出相同的结论，但只能基于此数据。您无法在数据集之外进行概括。

The verdict — If your data is (representative of) the population, you are good. Otherwise, try to collect representative data or only make a conclusion on the data you have created the model on.

结论 -如果您的数据代表人口 ( 代表 )，那就好。否则，请尝试收集代表性数据或仅对在其上创建模型的数据做出结论。

For business readers — If you have data on all your customers, and want to predict the behavior of new customers, you are fine as long as you are targeting a similar type of customer. If not, you may be looking at providing totally wrong recommendations or conclusions about these new customers, and losing them before you got them in. So ask for the representativeness of the dataset.

对于商业读者 -如果您拥有所有客户的数据，并且希望预测新客户的行为，那么只要定位到类似类型的客户，就可以了。如果没有，您可能正在寻找关于这些新客户的完全错误的建议或结论，并在吸引他们之前就失去了它们。因此， 请索要 数据集的代表性 。

So ask for the representativeness of the dataset. If the data is representative for the population, you are good.

因此，要求数据集的代表性。如果数据可以代表总体，那您就很好。

假设2：随机扰动，均值为零 (Assumption 2: random disturbances, zero mean)

We assume that the error margin around our model is random and on average level out over all observations. This is something you can actually check.

我们假设模型周围的误差幅度是随机的，并且在所有观察结果中平均处于误差水平。您实际上可以检查一下。

The verdict — Take the average of all your error terms and verify if it’s statistically significantly different from zero. If yes → you may want to adjust your model and include more terms.

判决 -取所有误差项的平均值，并验证其在统计上是否显着不同于零。如果是→您可能要调整模型并包括更多术语。

For business readers — You want your model to predict the right thing. If this condition is not met, you are either always under- or overestimating. For example, if your error term is on average 3.5, that means you are on average overestimating with 3.5. Not a good thing to happen if you are predicting stock prices and making automatic trading decisions. So ask for the average of the error terms.

对于商业读者 -您希望模型预测正确的事情。如果不满足此条件，则说明您总是低估 或高估了 。例如，如果您的误差项平均为3.5，则意味着您平均高估了3.5。如果您预测股票价格并做出自动交易决策，那将不是一件好事。因此，请提供误差项的平均值。

Ask for the average of the error terms, to understand whether you are over- or underestimating. If the average is about 0, you are good.

要求平均误差项，以了解您是高估还是低估了 。如果平均值大约为0，则表示您很好。

假设3：同调 (Assumption 3: homoscedasticity)

The variance of the disturbances exist and are equal. This means as much as that we expect the error in the model to be of similar size for all different data points and is sometimes referred to as homogeneity of variance. This only applies if the relationship that we are looking at is linear on all different levels.

扰动的方差存在且相等。这意味着我们可以预期模型中的误差对于所有不同的数据点都具有相似的大小，并且有时被称为方差均匀性。仅当我们正在研究的关系在所有不同级别上都是线性的时，这才适用。

For example, if you are looking at the relationship between income and spendings on traveling. The spread will be much less for lower incomes than for higher incomes, simply because higher-incomes will provide more of a choice on what to spend. The result is that your model gets ‘pulled’ in the wrong direction (because it assumes the spread is equal everywhere and tries to reduce the error), and the influence on the model of the higher-income data points is much larger than the lower-income data points.

例如，如果您正在查看收入与旅行支出之间的关系。低收入者的点差将比高收入者的点差小得多，这仅仅是因为高收入者将提供更多消费选择。结果是您的模型在错误的方向上被“拉”(因为它假定分布在所有地方都是相等的，并试图减少误差)，并且对高收入数据点的模型的影响要比低收入数据点大得多-收入数据点。

In addition, this will influence the ability to make conclusions on the significance of your parameters.

此外，这将影响对参数重要性做出结论的能力。

The verdict — If you want to use your model for inference test for homoscedasticity, if you find your error terms aren’t equally distributed → scale (one of) your variable(s) or use WLS.

结论 —如果您想使用模型进行同态推断测试，并且发现错误项分布不均 → 缩放 ( 变量之一)或使用WLS 。

For business readers — You want the error terms to have homogeneous variance, otherwise, some of your data points may have a too large influence on the model and disturb the view for the rest of the data points. It is not that big of an issue, your model will still predict the right thing. So if that is what you care about, this is one to let slip.

对于商业读者 -您希望误差项具有均一的方差，否则，您的某些数据点可能会对模型产生太大的影响，并干扰其余数据点的视图 。这不是什么大问题，您的模型仍然可以预测 正确的事情。因此，如果这是您所关心的，那么这是一个令人毛骨悚然的问题。

If you just want to predict, let this one slip. If you want to infer on relations, better make a change.

如果您只想预测，就让它滑一下。如果要推断关系，最好进行更改。

假设4：无相关 (Assumption 4: no correlation)

The error terms are uncorrelated. If they weren’t, there would actually be potential to improve the model. What it means is that if there is a correlation in the error terms, there is still “explanatory” power that is available. The result of the violation of this assumption is a bias in the coefficients of your model. These coefficients “absorb” the information from the error terms.

错误项是不相关的。如果没有，那么实际上就有改进模型的潜力。意思是，如果误差项之间存在相关性，那么仍然有“解释性”的能力可用。违反此假设的结果是模型系数存在偏差。这些系数从误差项中“吸收”信息。

The verdict — If you want to use your model for inference test correlation in your error terms, and if you find correlation → Add in more variables.

结论 —如果您想将模型用于错误项的推理测试相关性，并且发现相关性，请添加更多变量。

For business readers — If you are interested in making conclusions on relationships, correlation in the error terms is a no go. Correlation in the error terms also tells you there is a potential to improve the model and generate better predictions.

对于商业读者 —如果您有兴趣对关系做出结论，那么错误术语之间的相关性是不可行的 。误差项中的相关性还告诉您，有可能改进模型并生成更好的 预测。

If there is a correlation present, you need to improve the model, your predictions get better and your inference will make sense.

如果存在相关性，则需要改进模型，您的预测会变得更好，并且您的推论将变得有意义。

假设5：常量参数 (Assumption 5: constant parameters)

The parameters that you are estimating with the model are fixed and unknown numbers. For starters, if they were known, there’s no need for a model. And the reason why we assume they are fixed is that we want to avoid changes over time. That is the time meant in the sense of time when the data was collected. If there are changes over time, we may need to include two different parameters or take only the most recent sample of the data.

您要使用模型估计的参数是固定和未知数。对于初学者来说，如果知道的话，就不需要模型了。我们之所以认为它们是固定的，是因为我们希望避免随着时间的变化。从时间上看，这是指收集数据的时间。如果随时间发生变化，我们可能需要包括两个不同的参数或仅获取最新的数据样本。

An example of a violation would be if data was collected by asking a customer how much money they have paid into their pension fund, and the yearly maximum amount has been changed last year and suddenly you can add in a few thousand more. In this case, your parameters aren’t constant, and you need to account for that.

一个违规的例子是，如果通过询问客户已向养老基金支付了多少钱来收集数据，并且去年更改了年度最高金额，突然您又可以增加几千元。在这种情况下，您的参数不是恒定的，您需要考虑到这一点。

The verdict — Can you safely say that the data at hand has been produced by the same process, that hasn’t changed over time? → Then you are good. If not → you will want to adjust your model and allow for new variables to enter.

结论 —您是否可以肯定地说，手头的数据是通过相同的过程生成的，并且随着时间的推移没有变化？ →那你就好了。如果不是→，则需要调整模型并允许输入几个变量 。

For business readers — The key here is that data was produced by the same process, has the data collection changed over time? If it has, the conclusions made on relations between the different variables will not hold, and predictions on new data coming in may actually be under- or overestimated.

对于商业读者来说 ，关键是数据是通过相同的过程生成的 ，数据收集是否随时间而变化？如果有，关于不同变量之间关系的结论将不成立 ，对新数据的预测实际上可能被低估或高估了 。

Has the data collection changed over time? Then adjust the model, otherwise you may risk over- or underestimate your predictions with new data coming in.

数据收集是否随着时间而改变？然后调整模型，否则可能会因输入新数据而有可能高估或低估您的预测。

假设6：线性模型 (Assumption 6: linear model)

The relationship between the different variables is a linear relationship. If this weren’t the case, and you would have a non-linear relationship, you cannot estimate a model that fits your data properly. Therefore, when you are creating a linear model, you need to assume linearity. This is not a linear relationship, and if you would treat it that way, you would estimate many people on the streets with 50 degrees Celcius.

不同变量之间的关系是线性关系。如果不是这种情况，并且您将具有非线性关系，那么您将无法估算出适合您数据的模型。因此，在创建线性模型时，需要假设线性。这不是线性关系，如果以这种方式对待，您将估计许多街道上摄氏50度的人。

The verdict — Test for linearity (scatterplots do the trick), and if the relationship isn’t linear → Transform your variables or go for a different model

判决 -测试线性(散点图可以解决问题)，并且如果关系不是线性的→ 转换变量或使用其他模型

For business readers — This type of model dictates the structure between what we try to predict and what goes into the model. If the structure isn’t met (in this case linearity), the model is meaningless. You can think logically if the relationship is expected to be linear. If it’s not, and if the test tell the relationship isn’t linear → This is a no go and the model needs adjustment both for making conclusions on the relationship as well as prediction.

对于商业读者来说 ，这种类型的模型决定了我们试图预测的内容与模型所包含的内容之间的结构。如果不满足结构要求(在这种情况下为线性)，则该模型无意义 。您可以从逻辑上考虑是否期望该关系是线性的。如果不是，并且测试证明该关系不是线性的→这是不可行的，并且该模型需要进行调整以得出该关系以及预测的结论。

If the model is linear, but the relationship isn’t, you can forget about inference as well as prediction.

如果模型是线性的，但关系不是线性的，则您可以忘记推断和预测。

假设7：正常 (Assumption 7: normality)

This assumption says that the error terms are normally distributed. We want to verify this because we want to be able to make tests on significance, as well as define our confidence intervals.

该假设表明误差项是正态分布的。我们想要验证这一点，因为我们希望能够对重要性进行检验，并定义我们的置信区间。

The verdict — Plot your error terms and verify if they are normal. If they are not normally distributed→ check your linearity assumption again.

判决-绘制错误术语并验证它们是否正常。如果他们不是正态分布 →再次检查线性假设 。

For business readers — This assumption allows us to tell us something about how sure we are about the estimated values in our model. If this assumption is not met, we cannot make conclusions about relationships, we can predict though.

对于商业读者 -此假设使我们可以告诉我们一些有关我们如何确定模型中估计值的信息。如果不满足这个假设，我们将无法 得出关于关系的 结论，但是我们可以预测 。

Without this assumption, we cannot say how sure we are about our estimated parameters. We can predict on new data.

没有这个假设，我们就无法说出我们对估计参数的确信程度。我们可以预测新数据。

Inspired by: “Econometric Methods with Applications in Business and Economics” by Christiaan Heij, Paul de Boer, Philip Hans Franses, Teun Kloek and Herman K. van Dijk

灵感来自于：Christiaan Heij，Paul de Boer，Philip Hans Franses，Teun Kloek和Herman K. van Dijk撰写的“计量经济学方法在商业和经济学中的应用”

About me: I am an Analytics Consultant and Director of Studies for “AI Management” at a local business school. I am on a mission to help organizations generating business value with AI and creating an environment in which Data Scientists can thrive. Sign up to my newsletter for new articles, insights, and offerings on AI Management here.

关于我：我是当地商学院的分析顾问和“ AI管理”研究总监。 我的使命是帮助组织通过AI创造业务价值，并创造一个数据科学家可以蓬勃发展的环境。 在此处 注册我的时事通讯，以获得有关AI Management的新文章，新见解和新产品 。