python计算方差膨胀因子_Variance Inflation Factor (VIF) 方差膨胀因子解释_附python脚本...

python信用评分卡(附代码,博主录制)

Colinearity is the state where two variables are highly correlated and contain similiar information about the variance within a given dataset. To detect colinearity among variables, simply create a correlation matrix and find variables with large absolute values. In R use the corr function and in python this can by accomplished by using numpy's corrcoeffunction.

Multicolinearity on the other hand is more troublesome to detect because it emerges when three or more variables, which are highly correlated, are included within a model. To make matters worst multicolinearity can emerge even when isolated pairs of variables are not colinear.

A common R function used for testing regression assumptions and specifically multicolinearity is "VIF()" and unlike many statistical concepts, its formula is straightforward:

$$ V.I.F. = 1 / (1 - R^2). $$

The Variance Inflation Factor (VIF) is a measure of colinearity among predictor variables within a multiple regression. It is calculated by taking the the ratio of the variance of all a given model's betas divide by the variane of a single beta if it were fit alone.

Steps for Implementing VIF

Run a multiple regression.

Calculate the VIF factors.

Inspect the factors for each predictor variable, if the VIF is between 5-10, multicolinearity is likely present and you should consider dropping the variable.

#Imports

import pandas as pd import numpy as np from patsy import dmatrices import statsmodels.api as sm from statsmodels.stats.outliers_influence import variance_inflation_factor df = pd.read_csv('loan.csv') df.dropna() df = df._get_numeric_data() #drop non-numeric cols df.head()

id

member_id

loan_amnt

funded_amnt

funded_amnt_inv

int_rate

installment

annual_inc

dti

delinq_2yrs

...

total_bal_il

il_util

open_rv_12m

open_rv_24m

max_bal_bc

all_util

total_rev_hi_lim

inq_fi

total_cu_tl

inq_last_12m

0

1077501

1296599

5000.0

5000.0

4975.0

10.65

162.87

24000.0

27.65

0.0

...

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

1

1077430

1314167

2500.0

2500.0

2500.0

15.27

59.83

30000.0

1.00

0.0

...

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

2

1077175

1313524

2400.0

2400.0

2400.0

15.96

84.33

12252.0

8.72

0.0

...

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

3

1076863

1277178

10000.0

10000.0

10000.0

13.49

339.31

49200.0

20.00

0.0

...

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

4

1075358

1311748

3000.0

3000.0

3000.0

12.69

67.79

80000.0

17.94

0.0

...

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

5 rows × 51 columns

df = df[['annual_inc','loan_amnt', 'funded_amnt','annual_inc','dti']].dropna() #subset the dataframe

Step 1: Run a multiple regression

%%capture

#gather features

features = "+".join(df.columns - ["annual_inc"]) # get y and X dataframes based on this regression: y, X = dmatrices('annual_inc ~' + features, df, return_type='dataframe')

Step 2: Calculate VIF Factors

# For each X, calculate VIF and save in dataframe

vif = pd.DataFrame() vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])] vif["features"] = X.columns

Step 3: Inspect VIF Factors

vif.round(1)

VIF Factor

features

0

5.1

Intercept

1

1.0

dti

2

678.4

funded_amnt

3

678.4

loan_amnt

As expected, the total funded amount for the loan and the amount of the loan have a high variance inflation factor because they "explain" the same variance within this dataset. We would need to discard one of these variables before moving on to model building or risk building a model with high multicolinearity.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值