数据线性回归数据_对相关数据使用回归

最新推荐文章于 2022-05-09 15:07:53 发布

weixin_26713457

最新推荐文章于 2022-05-09 15:07:53 发布

阅读量2.4k

点赞数 1

文章标签： python 人工智能 java 机器学习大数据

原文链接：https://towardsdatascience.com/using-regression-with-correlated-data-5845a2eed3d2

版权

数据线性回归数据

影片教学(Video Tutorial)

While regression models are easy to run given their short, simple syntax, this accessibility also makes it easy to use regression inappropriately. These models have several key assumptions that need to be met in order for their output to be valid, but your code will typically run whether or not these assumptions have been met.

尽管使用简短，简单的语法就可以轻松运行回归模型，但是这种可访问性还使得可以轻松地不适当地使用回归。这些模型有几个关键假设，必须满足这些假设才能使输出有效，但是无论是否满足这些假设，代码通常都会运行。

Video tutorial

影片教学

For linear regression (used with a continuous outcome), these assumptions are as follows:

对于线性回归(用于连续结果)，这些假设如下：

Independence: All observations are independent of each other, residuals are uncorrelated
独立性：所有观察值彼此独立，残差不相关
Linearity: The relationship between X and Y is linear
线性：X和Y之间的关系是线性的
Homoscedasticity: Constant variance of residuals at different values of X
均方差：不同X值处残差的恒定方差
Normality: Data should be normally distributed around the regression line
正态性：数据应围绕回归线正态分布

For logistic regression (used with a binary or ordinal categorical outcome), these assumptions are as follows:

对于逻辑回归(与二元或有序分类结果一起使用)，这些假设如下：

Independence: All observations are independent of each other, residuals are uncorrelated
独立性：所有观察值彼此独立，残差不相关
Linearity in the logit: The relationship between X and the logit of Y is linear
对数的线性：X与Y的对数之间的关系是线性的
Model is correctly specified, including lack of multicollinearity
模型已正确指定，包括缺乏多重共线性

In both kinds of simple regression models, independent observations are absolutely necessary to fit a valid model. If your data points are correlated, this assumption of independence is violated. Fortunately, there are still ways to produce a valid regression model with correlated data.

在两种简单的回归模型中，独立观察对于拟合有效模型都是绝对必要的。如果您的数据点相互关联，则违反了这种独立性的假设。幸运的是，仍然有方法可以使用相关数据生成有效的回归模型。

相关数据 (Correlated Data)

Correlation in data occurs primarily through multiple measurements (e.g. two measurements are taken on each participant 1 week apart, and data points within individuals are not independent) or if there is clustering in the data (e.g. a survey is conducted among students attending different schools, and data points from students within a given school are not independent).

数据之间的相关性主要是通过多次测量(例如，每位参与者每隔1周进行两次测量，并且个体内的数据点不是独立的)或数据中存在聚类(例如，对在不同学校就读的学生进行的调查，并且来自给定学校的学生的数据点不是独立的)。

The result is that that the outcome has been measured on the level of an individual observation, but that there is a second level of either an individual (in the case of multiple time points) or clusters on which individual data points can be correlated. Ignoring this correlation means that standard error cannot be accurately computed, and in most cases will be artificially low.

结果是，已根据单个观察值对结果进行了度量，但是存在单个级别(在多个时间点的情况下)或可以与单个数据点相关的聚类的第二个级别。忽略这种相关性意味着无法准确地计算标准误差，并且在大多数情况下人为地降低了标准误差。

The best way to know if your data is correlated is simply through familiarity with your data and the collection process that produced it. If you know that you have repeated measures from the same individuals or have data on participants who can be grouped into families or schools, you can assume that your data points are probably not independent. You can also investigate your data for possible correlation by calculating the ICC (intraclass correlation coefficient) to determine how correlated data points are within possible groups, or by looking for correlation in your residuals.

知道您的数据是否相关的最好方法就是简单地熟悉数据以及生成数据的收集过程。如果您知道自己重复了同一个人的测量数据，或者掌握了可以分为家庭或学校的参与者的数据，则可以假定您的数据点可能不是独立的。您还可以通过计算ICC(类内相关系数)来确定相关数据点在可能组中的程度，或者通过查找残差中的相关性来调查数据是否可能具有相关性。

相关数据的回归建模 (Regression Modeling with Correlated Data)

As previously mentioned, simple regression will produce inaccurate standard errors with correlated data and therefore should not be used.

如前所述，简单回归将对相关数据产生不准确的标准误差，因此不应使用。

Instead, you want to use models that can account for the correlation that is present in your data. If the correlation is due to some grouping variable (e.g. school) or repeated measures over time, then you can choose between Generalized Estimating Equations or Multilevel Models. These modeling techniques can handle either binary or continuous outcome variables, so can be used to replace either logistic or linear regression when the data are correlated.

相反，您想使用可以说明数据中存在的相关性的模型。如果相关性是由于某些分组变量(例如学校)或随着时间的推移重复测量而造成的，则可以在广义估计方程式或多级模型之间进行选择。这些建模技术可以处理二进制或连续结果变量，因此可以在数据相关时用来代替逻辑回归或线性回归。

广义估计方程 (Generalized Estimating Equations)

Generalized estimating equations (GEE) will give you beta estimates that are the same or similar to those produced by simple regression, but with appropriate standard errors. Generalized estimating equations are particularly useful when you have repeated measures for the same individuals or units. This modeling technique tends to work well when you have many small clusters, which is often the result of having a few measurements on a large number of participants. GEE also allows the user to specify one of numerous correlation structures, which can be a useful feature depending on your data.

广义估计方程(GEE)将为您提供与通过简单回归产生的估计值相同或相似但具有适当标准误差的beta估计值。当您对同一个人或单位重复测量时，广义估计方程式特别有用。当您有许多小型群集时，这种建模技术通常会很好地起作用，这通常是对大量参与者进行少量测量的结果。 GEE还允许用户指定众多相关结构之一，这可能是有用的功能，具体取决于您的数据。

多层次建模 (Multilevel Modeling)

Multilevel modeling (MLM) also provides appropriate standard errors when data points are not independent. It is typically the best modeling approach when the user is interested in relationships both within and between clustered groups, and is not simply looking to account for the effect of correlation in standard error estimates. MLM has the additional advantage of being able to handle more than two levels in the response variable. The primary drawback of MLM models is that they require larger sample sizes within each cluster, so may not work well when clusters are small.

当数据点不是独立的时，多级建模(MLM)还提供适当的标准错误。当用户对聚类组内和聚类组之间的关系感兴趣时，并且不是简单地在标准误差估计中考虑相关性的影响时，这通常是最佳的建模方法。 MLM的另一个优势是能够处理响应变量中的两个以上级别。 MLM模型的主要缺点是它们在每个群集中都需要较大的样本量，因此在群集较小时可能无法很好地工作。

Both GEE and MLM are fairly easy to use in R. Below

最低0.47元/天解锁文章

weixin_26713457

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
数据线性回归数据_对相关数据使用回归

数据线性回归数据影片教学(Video Tutorial)While regression models are easy to run given their short, simple syntax, this accessibility also makes it easy to use regression inappropriately. These models have severa...
复制链接

扫一扫