回归分析中自变量共线性_具有大特征空间的回归分析中的变量选择

回归分析中自变量共线性

介绍 (Introduction)

Performing multiple regression analysis from a large set of independent variables can be a challenging task. Identifying the best subset of regressors for a model involves optimizing against things like bias, multicollinearity, exogeneity/endogeneity, and threats to external validity. Such problems become difficult to understand and control in the presence of a large number of features. Professors will often tell you to “let theory be your guide” when going about feature selection, but that is not always so easy.

从大量独立变量中进行多元回归分析可能是一项艰巨的任务。 为模型确定最佳的回归子集涉及针对偏差,多重共线性,外生性/内生性以及对外部有效性的威胁等方面的优化。 在存在大量特征的情况下,此类问题变得难以理解和控制。 在进行特征选择时,教授通常会告诉您“让理论作为指导”,但这并不总是那么容易。

This blog considers the issue of multicollinearity and suggests a method of avoiding it. Proposed here is not a “solution” to collinear variables, nor is it a perfect way of identifying them. It is simply one measurement to take into consideration when comparing multiple subsets of variables.

该博客考虑了多重共线性问题,并提出了避免这种问题的方法。 这里提出的不是共线变量的“解决方案”,也不是识别它们的理想方法。 比较变量的多个子集时,它只是一种要考虑的度量。

问题 (The Problem)

There are several ways of identifying the features that are causing problems in a model. The most common approach (and the basis of this post) is to calculate correlations between suspected collinear variables. While effective, it is important to acknowledge the shortcomings of this method. For instance, correlation coefficients are often biased by sample sizes, and bivariate correlation cannot detect two variables that are collinear only in the presence of additional variables. For these reasons, it is a good idea to consider other metrics/methods as well, some of which include the following: look at the significance of coefficients compared to the overall model; look for high standard error; calculate variance inflation factors for different features; conduct principal components analysis; and yes, let theory be your guide.

有几种方法可以识别导致模型出现问题的特征。 最常见的方法(也是本文的基础)是计算可疑共线变量之间的相关性。 尽管有效,但重要的是要认识到此方法的缺点。 例如,相关系数通常受样本量的影响,而双变量相关仅在存在其他变量的情况下无法检测到共线的两个变量。 由于这些原因,考虑其他指标/方法也是一个好主意,其中的一些指标/方法包括:与整体模型相比,考察系数的重要性; 寻找高标准误差; 计算不同特征的方差膨胀因子; 进行主成分分析; 是的,以理论为指导。

With all of this in mind, let us now consider a technique that employs a collection of transformed Pearson correlation coefficients in a multiple-criteria evaluation problem (see Multiple-Criteria Decision Analysis). The goal of the technique is to find a subset of independent variables where every pairwise correlation within the set is as low as possible, while simultaneously, each variable’s correlation with the dependent variable is as high as possible. We may represent the problem in the following way:

考虑到所有这些,现在让我们考虑一种在多准则评估问题中使用一组变换的Pearson相关系数的技术(请参阅多准则决策分析 )。 该技术的目标是找到独立变量的子集,其中集合中每个成对的相关性都应尽可能低,而同时,每个变量与因变量的相关性应尽可能地高。 我们可以

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值