特征多重共线对随机森林模型预测性能的影响研究

特征共线是否对随机森林模型的预测性能有影响?

我们为什么关注特征共线?

特征共线就是指数据集中的特征之间匹配得太好或特征高度相关,例如:降雨量和乌云云团大小、织物纤维和吸水能力等;

然而,在机器学习模型中,特征共线是一件坏事。它可能造成模型偏向于某些特征,而导致信息丢失,尤其是在多特征回归任务中更是如此。

实际上,特征共线对随机森林模型并没有影响。这里将对特征共线对随机森林模型的影响进行讨论。

下面是本文的一些参考链接:

参考链接1

参考链接2

# 工具包导入
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os

import warnings
warnings.filterwarnings('ignore')
# 显示当前工作目录
%pwd
'D:\\python code\\9日常\\--------20210723特征共线对随机森林模型的影响--------\\TheDataVolcano-master'
# 载入数据,以下是数据地址 
# https://catalog.data.gov/dataset/state-of-new-york-mortgage-agency-sonyma-loans-purchased-beginning-2004
df=pd.read_csv('./datasets/State_of_New_York_Mortgage_Agency.csv')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28528 entries, 0 to 28527
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Bond Series              28528 non-null  object
 1   Original Loan Amount     28528 non-null  object
 2   Loan Purchase Date       28528 non-null  object
 3   Purchase Year            28528 non-null  int64 
 4   Original Loan To Value   28528 non-null  object
 5   Loan Type                28528 non-null  object
 6   SONYMA DPAL/CCAL Amount  21012 non-null  object
 7   Original Term            28528 non-null  int64 
 8   County                   28528 non-null  object
 9   FIPS Code                28528 non-null  int64 
 10  Number of Units          28528 non-null  object
 11  Property Type            28528 non-null  object
 12  Housing Type             28528 non-null  object
 13  Household Size           28528 non-null  int64 
dtypes: int64(4), object(10)
memory usage: 3.0+ MB
df.head(10)
Bond SeriesOriginal Loan AmountLoan Purchase DatePurchase YearOriginal Loan To ValueLoan TypeSONYMA DPAL/CCAL AmountOriginal TermCountyFIPS CodeNumber of UnitsProperty TypeHousing TypeHousehold Size
0Series 109/110$3247001/02/2004200497%Conventional$2933360Monroe360551 FamilyDetachedExisting1
1Series 109/110$4850001/02/2004200497%Conventional$3435360Genesee360371 FamilyDetachedExisting4
2Series 109/110$4947001/02/2004200497%Conventional$4996360Monroe360551 FamilyDetachedExisting3
3Series 109/110$5820001/02/2004200497%Conventional$4170360Erie360291 FamilyDetachedExisting2
4Series 109/110$6499001/02/2004200497%Conventional$4940360Erie360291 FamilyDetachedExisting3
5Series 109/110$6499001/02/2004200497%Conventional$4772360Schenectady360931 FamilyDetachedExisting1
6Series 109/110$6790001/02/2004200497%Conventional$5000360Orleans360731 FamilyDetachedExisting3
7Series 109/110$6790001/02/2004200497%Conventional$4845360Wayne361171 FamilyDetachedExisting2
8Series 109/110$7277501/02/2004200497%Conventional$5000360Monroe360551 FamilyDetachedExisting2
9Series 109/110$7711501/02/2004200497%Conventional$5000360Broome360071 FamilyDetachedExisting2
df.columns
Index(['Bond Series', 'Original Loan Amount', 'Loan Purchase Date ',
       'Purchase Year', 'Original Loan To Value', 'Loan Type ',
       'SONYMA DPAL/CCAL Amount', 'Original Term', 'County', 'FIPS Code',
       'Number of Units', 'Property Type', 'Housing Type', 'Household Size '],
      dtype='object')
dfmod=df[['Original Loan Amount', 'Purchase Year', 'Original Loan To Value', 'SONYMA DPAL/CCAL Amount', 'Number of Units', \
'Household Size ', 'Property Type', 'County', 'Housing Type', 'Bond Series', 'Original Term']]

# turn off warnings on the slice operation we do below. 
# This is a unique factorize problem because it returns a tuple, sigh
# https://stackoverflow.com/questions/45080400/dealing-with-pandas-settingwithcopywarning-without-indexer
pd.options.mode.chained_assignment = None 

# factorize changes features from like 'condo' and 'house' to numeric (1, 2, etc.) so our model can handle it
stacked = dfmod[['Property Type', 'County', 'Housing Type', 'Bond Series']].stack()
dfmod[['Property Type', 'County', 'Housing Type', 'Bond Series']] = pd.Series(stacked.factorize()[0], \
                                                                              index=stacked.index).unstack()

# use regex replace to fix some of the columns that have partial numeric, partial text values
dfmod=dfmod.replace('[\$,]', '', regex=True)
dfmod=dfmod.replace('[\%,]', '', regex=True)
dfmod=dfmod.replace('Family', '', regex=True)

# need to convert to float
dfmod=dfmod.astype(float)

查看原始数据集中相关的特征是哪些?

dfmod.columns
Index(['Original Loan Amount', 'Purchase Year', 'Original Loan To Value',
       'SONYMA DPAL/CCAL Amount', 'Number of Units', 'Household Size ',
       'Property Type', 'County', 'Housing Type', 'Bond Series',
       'Original Term'],
      dtype='object')
dfmod.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28528 entries, 0 to 28527
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Original Loan Amount     28528 non-null  float64
 1   Purchase Year            28528 non-null  float64
 2   Original Loan To Value   28528 non-null  float64
 3   SONYMA DPAL/CCAL Amount  21012 non-null  float64
 4   Number of Units          28528 non-null  float64
 5   Household Size           28528 non-null  float64
 6   Property Type            28528 non-null  float64
 7   County                   28528 non-null  float64
 8   Housing Type             28528 non-null  float64
 9   Bond Series              28528 non-null  float64
 10  Original Term            28528 non-null  float64
dtypes: float64(11)
memory usage: 2.4 MB
dfmod.head(10)
Original Loan AmountPurchase YearOriginal Loan To ValueSONYMA DPAL/CCAL AmountNumber of UnitsHousehold SizeProperty TypeCountyHousing TypeBond SeriesOriginal Term
032470.02004.097.02933.01.01.00.01.02.03.0360.0
148500.02004.097.03435.01.04.00.04.02.03.0360.0
249470.02004.097.04996.01.03.00.01.02.03.0360.0
358200.02004.097.04170.01.02.00.05.02.03.0360.0
464990.02004.097.04940.01.03.00.05.02.03.0360.0
564990.02004.097.04772.01.01.00.06.02.03.0360.0
667900.02004.097.05000.01.03.00.07.02.03.0360.0
767900.02004.097.04845.01.02.00.08.02.03.0360.0
872775.02004.097.05000.01.02.00.01.02.03.0360.0
977115.02004.097.05000.01.02.00.09.02.03.0360.0
# 绘制热力图
from mlxtend.plotting import heatmap

cols = ['Original Loan Amount', 'Purchase Year', 'Original Loan To Value',\
       'SONYMA DPAL/CCAL Amount', 'Number of Units', 'Household Size ',\
       'Property Type', 'County', 'Housing Type', 'Bond Series',\
       'Original Term']
cm = np.corrcoef(dfmod[cols].values.T)
"""
下图中的nan出现的原因是,'SONYMA DPAL/CCAL Amount'含有空值null
"""
hm = heatmap(cm, row_names=cols, column_names=cols, figsize=(12, 12))

# 保存图表
plt.savefig('./heatmaps.png', dpi=300)
plt.show()

在这里插入图片描述

# test for correlations 
corrDF=dfmod.corr()
corrDF

Original Loan AmountPurchase YearOriginal Loan To ValueSONYMA DPAL/CCAL AmountNumber of UnitsHousehold SizeProperty TypeCountyHousing TypeBond SeriesOriginal Term
Original Loan Amount1.0000000.337831-0.0569020.6620540.1127230.2383690.1010850.2328900.1241330.3259470.184459
Purchase Year0.3378311.000000-0.152347-0.062682-0.0053650.0733430.1497630.1051000.1132730.9225740.058512
Original Loan To Value-0.056902-0.1523471.000000-0.0917550.028671-0.033281-0.294189-0.189966-0.294904-0.167013-0.002098
SONYMA DPAL/CCAL Amount0.662054-0.062682-0.0917551.0000000.0502820.2021760.0259310.1600470.152185-0.0788720.199689
Number of Units0.112723-0.0053650.0286710.0502821.000000-0.004223-0.003898-0.0274160.013539-0.006489-0.002910
Household Size0.2383690.073343-0.0332810.202176-0.0042231.000000-0.0437920.1079120.0850880.0737330.076269
Property Type0.1010850.149763-0.2941890.025931-0.003898-0.0437921.0000000.1976860.2241630.1584720.030414
County0.2328900.105100-0.1899660.160047-0.0274160.1079120.1976861.0000000.1742620.1155250.050321
Housing Type0.1241330.113273-0.2949040.1521850.0135390.0850880.2241630.1742621.0000000.1342840.046703
Bond Series0.3259470.922574-0.167013-0.078872-0.0064890.0737330.1584720.1155250.1342841.0000000.054699
Original Term0.1844590.058512-0.0020980.199689-0.0029100.0762690.0304140.0503210.0467030.0546991.000000

从上表可以看出,特征之间并没有呈现出高度相关性,特征’Original Loan Amount’和特征 'SONYMA DPAL/CCAL Amount’相关性系数达到了0.66.

制造一些相关性很强的假数据

生成新的数据列’Grandmas Loan Agency’ ,它与列 'SONYMA DPAL/CCAL Amount’高度相关,数据展示如下:

randoms = np.linspace(0.9, 1.1, len(dfmod))
dfmod['Grandmas Loan Agency']=dfmod['SONYMA DPAL/CCAL Amount']*randoms
corrDF=dfmod.corr()
corrDF
Original Loan AmountPurchase YearOriginal Loan To ValueSONYMA DPAL/CCAL AmountNumber of UnitsHousehold SizeProperty TypeCountyHousing TypeBond SeriesOriginal TermGrandmas Loan Agency
Original Loan Amount1.0000000.337831-0.0569020.6620540.1127230.2383690.1010850.2328900.1241330.3259470.1844590.683658
Purchase Year0.3378311.000000-0.152347-0.062682-0.0053650.0733430.1497630.1051000.1132730.9225740.0585120.030616
Original Loan To Value-0.056902-0.1523471.000000-0.0917550.028671-0.033281-0.294189-0.189966-0.294904-0.167013-0.002098-0.105809
SONYMA DPAL/CCAL Amount0.662054-0.062682-0.0917551.0000000.0502820.2021760.0259310.1600470.152185-0.0788720.1996890.993475
Number of Units0.112723-0.0053650.0286710.0502821.000000-0.004223-0.003898-0.0274160.013539-0.006489-0.0029100.048050
Household Size0.2383690.073343-0.0332810.202176-0.0042231.000000-0.0437920.1079120.0850880.0737330.0762690.207818
Property Type0.1010850.149763-0.2941890.025931-0.003898-0.0437921.0000000.1976860.2241630.1584720.0304140.038477
County0.2328900.105100-0.1899660.160047-0.0274160.1079120.1976861.0000000.1742620.1155250.0503210.167676
Housing Type0.1241330.113273-0.2949040.1521850.0135390.0850880.2241630.1742621.0000000.1342840.0467030.167975
Bond Series0.3259470.922574-0.167013-0.078872-0.0064890.0737330.1584720.1155250.1342841.0000000.0546990.009896
Original Term0.1844590.058512-0.0020980.199689-0.0029100.0762690.0304140.0503210.0467030.0546991.0000000.207590
Grandmas Loan Agency0.6836580.030616-0.1058090.9934750.0480500.2078180.0384770.1676760.1679750.0098960.2075901.000000

构建一个随机森林模型

from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
from sklearn.model_selection import train_test_split

def ourModel(data, result):  
    # inputs
    # data = pandas data frame (x)
    # results = column of desired result (y)
    
    
    # split the test - train set 
    X_train, X_test, y_train, y_test = train_test_split(
    data , result , test_size=0.25, random_state=1)

    # setup the model
    clf = RandomForestRegressor(n_estimators=100, n_jobs=4, oob_score =True)
    clf.fit(X_train, y_train)
    predictions=clf.predict(X_test)
    print('r2: ' + str(metrics.r2_score(predictions, y_test)))
    print('mse: '+ str(metrics.mean_squared_error(predictions, y_test)))
    
    # feature importance
    importances=clf.feature_importances_
    indices = np.argsort(importances)
    fp=zip(data.columns.values[indices], importances[indices])
    
    return(fp)
    

执行上述函数

dfmod=dfmod.dropna()
result=dfmod['Original Loan Amount']
data_fake=dfmod.drop(['Original Loan Amount'], axis=1)

print('WITH MADE UP DATA')
fp_fake=ourModel(data_fake, result)

data_nofake=dfmod.drop(['Original Loan Amount', 'Grandmas Loan Agency'], axis=1)
print("\nWITHOUT MADE UP DATA")
fp_nofake=ourModel(data_nofake, result)
WITH MADE UP DATA
r2: 0.8805216008148813
mse: 637577470.5899562

WITHOUT MADE UP DATA
r2: 0.8784191345199206
mse: 645873903.6564941

构造一些特征以图进一步改进模型

# let's add some financial data
# our purchase years range from 2004 to 2016. Let's get the average mortage rate in those years
# googled and found at : http://www.freddiemac.com/pmms/pmms30.html
years=np.linspace(2004., 2016., 13)
mort30=np.array([5.84, 5.87, 6.41, 6.34, 6.03, 5.04, 4.69, 4.45, 3.66, 3.98, 4.17, 3.85, 3.65])
dfmod['mort']=[mort30[years==dfmod['Purchase Year'].iloc[item]][0] for item in range(len(dfmod['Purchase Year']))]


# what else can we add? Maybe how many houses were purchased in that year 
# from https://www.statista.com/statistics/219963/number-of-us-house-sales/
housesBought=np.array([1203, 1283, 1051, 776, 485, 375, 323, 306, 368, 429, 437, 501, 560])*1000.
dfmod['housesBought']=[housesBought[years==dfmod['Purchase Year'].iloc[item]][0] for item in range(len(dfmod['Purchase Year']))]

#  and just because we don't want all our new data depending on year, let's do one about
#  expected wealth by family size in NY
#  source: https://www.justice.gov/ust/eo/bapcpa/20130501/bci_data/median_income_table.htm
#  assume > 4 is = 4

# let's make a "wealthy vs Poor" category
dfmod['income']=[0 if dfmod['Household Size '].iloc[x] < 3 or dfmod['Household Size '].iloc[x]\
    > 4 else 1 for x in range(len(dfmod['Household Size '])) ]

#
# run the model again
dfmod=dfmod.dropna()
result=dfmod['Original Loan Amount']
data_fake=dfmod.drop(['Original Loan Amount'], axis=1)

print('WITH MADE UP DATA, Round 2!')
fp_fake=ourModel(data_fake, result)

data_fake=dfmod.drop(['Original Loan Amount', 'Grandmas Loan Agency'], axis=1)
print("\nWITHOUT MADE UP DATA, Round 2!")
fp_nofake=ourModel(data_nofake, result)

# feature ranking for importances
print('\nFeatures in order of importance for fake:')
print(list(fp_fake))

print('\nFeatures in order of importance for NOT fake:')
print(list(fp_nofake))

WITH MADE UP DATA, Round 2!
r2: 0.8805119731776188
mse: 636127397.198097

WITHOUT MADE UP DATA, Round 2!
r2: 0.87777395649925
mse: 650974675.1996832

Features in order of importance for fake:
[('Original Term', 0.001193970255686264), ('income', 0.001962303252954665), ('housesBought', 0.0032933784659805597), ('Number of Units', 0.004629807343641515), ('Housing Type', 0.005638302615442727), ('Household Size ', 0.00818667858229475), ('Purchase Year', 0.010329245640254558), ('mort', 0.01719017704279596), ('Bond Series', 0.01786888405078895), ('Property Type', 0.02191737837312151), ('Original Loan To Value', 0.06067438107071076), ('County', 0.060777486720751416), ('SONYMA DPAL/CCAL Amount', 0.08007650880918302), ('Grandmas Loan Agency', 0.7062614977763934)]

Features in order of importance for NOT fake:
[('Original Term', 0.001331898105323796), ('Number of Units', 0.004787708555639128), ('Housing Type', 0.005980303169527446), ('Household Size ', 0.010415186933312198), ('Property Type', 0.022150192124642493), ('Bond Series', 0.02455088805708598), ('Purchase Year', 0.039911871359296906), ('Original Loan To Value', 0.062332516902338694), ('County', 0.064640597446016), ('SONYMA DPAL/CCAL Amount', 0.7638988373468174)]

可以看到,特征’Grandmas Loan Agency’具有较高的重要性指数的时候,对应的特征’SONYMA DPAL/CCAL Amount’重要性指数较低,大约为0.08;

当删除了特征’Grandmas Loan Agency’,对应的特征’SONYMA DPAL/CCAL Amount’重要性指数为 0.7638988373468174。总结解释如下:

随机森林模型的预测能力不受多重共线性的影响

但是数据的解释性会被多重共线影响。随机森林模型可以返回特征的重要性指数,如果存在多重共线,则importance会被影响。一些具有多重共线的特征的重要性会被相互抵消,从而影响我们解释和理解特征。

一种简单的理解:多重共线性的特征不会对决策树、随机森林的预测能力有影响。

多重共线性最极端的情况是有两个完全一样的特征,特征A和特征B。当特征A被使用之后,决策树不会再选择使用特征B,因为特征B并没有增加新的有效信息。同理,如何决策树先选择了使用特征B,那么特征A也不会再被使用。

所以基于树的模型不会受到多重共线性的影响

  • 2
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
在R语言中,可以使用随机森林模型进行回归分析。随机森林回归是一种基于决策树的集成学习方法,它通过随机选取特征子集和样本子集来构建多个决策树,最终将它们的预测结果进行平均或投票来得到最终的预测值。随机森林模型在解决回归问题时表现出色,因为它可以处理多重共线性和非线性关系,并且对异常值和缺失值具有较好的鲁棒性。 在R语言中,可以使用randomForest包来构建随机森林回归模型。通过调用randomForest函数,可以设置一些参数来控制模型的建立,如决策树的数量、特征子集的大小等。然后,可以使用predict函数来对新的数据进行预测。 除了随机森林回归模型,R语言中还有其他的回归模型可供选择,如多元线性回归模型。多元线性回归模型是一种用于建立多个解释变量与一个响应变量之间关系的线性模型。它假设响应变量与解释变量之间存在线性关系,并通过最小二乘法来估计模型的参数。虽然多元线性回归模型简单,但在处理非线性关系方面相对较弱。 在使用随机森林回归模型时,可以使用R语言中的cor函数来计算模型的R值。R值是判断模型拟合程度的指标,它表示预测值与真实值之间的线性相关性。R值越接近1,说明模型的拟合效果越好。 因此,通过R语言中的随机森林回归模型,可以有效地建立和预测回归问题,并使用R值来评估模型的拟合程度。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值