（机器学习）随机森林填补缺失值的思路和代码逐行详解

克里斯大炮

已于 2024-03-17 11:17:58 修改

阅读量2w

点赞数 61

分类专栏：机器学习文章标签：机器学习 python 数据分析

于 2020-11-13 15:10:11 首次发布

本文链接：https://blog.csdn.net/m0_46177963/article/details/109673426

版权

随机森林填补缺失值

1.使用0和均值来填补缺失值
2.用随机森林填补缺失值的思路
3.使用随机森林填补缺失值代码逐行详解
4.随机森林填补缺失值的效果
5.完整代码

我们从现实中收集的数据，几乎不可能是完美无缺的，往往都会有一些缺失值。 而随机森林的回归，有一个重要的应用，就是填补缺失值。在sklearn中，我们可以使用 sklearn.impute.SimpleImputer来轻松地将均值，中值，或者其他最常用的数值填补到数据中，它是专门用来填补缺失值的类

1.使用0和均值来填补缺失值

在使用随机森林填补缺失值之前，先来使用sklearn中专门用于填补缺失值的类sklearn.impute.SimpleImputer填补一下缺失值
准备工作：

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer

boston = load_boston()
x_full = boston.data     #x.shape = (506, 13)，共有506个样本，每个样本有13个特征
y_full = boston.target
n_samples = x_full.shape[0]		#506
n_features = x_full.shape[1]	#13

波士顿房价数据集是一个完整的数据集，先把它改成有缺失值的数据集，再进行填补。因此首先需要确定缺失数据的比例，这里我们假设为50%，那总共就由506✖13✖0.5=3289个数据缺失。
怎么随机的让数据缺失呢？方法是我们创造一个数组，此数组包含3289个分布在0-506之间的行索引，和3289个分布在0-13的列索引（因为一个缺失的数据会需要一个行索引和一个列索引），有了索引之后，就可以利用索引来为数据中的任意3289个位置赋空值，而索引都是在指定范围内随机生成的

rng = np.random.RandomState(0)
missing_rate = 0.5
#np.floor向下取整，返回.0格式的浮点数，因此还要转换成int把.0去掉。
#这里虽然不用np.floor和int直接计算也是整数，但是为了程序的鲁棒性还是加上，如果缺失数据的比例发生变化的话，直接计算可能就不是整数了
n_missing_samples = int(np.floor(n_samples * n_features * missing_rate))	#3289

#randiant(下限，上限，n)在上限和下限之间取出n个整数	
missing_features = rng.randint(0 , n_features , n_missing_samples)  #0-13 列索引
print(missing_features)
[12  5  0 ... 11  0  2]

missing_samples = rng.randint(0 , n_samples , n_missing_samples)	#0-506 行索引
print(missing_samples)
[150 125  28 ... 132 456 402]

我们现在采样了3289个数据，远远超过我们的样本量506，所以我们使用随机抽取的函数randint。但如果我们需要的数据量小于我们的样本量506，那我们可以采用np.random.choice来抽样，choice会随机抽取不重复的随机数，因此可以帮助我们让数据更加分散，确保数据不会集中在一些行中。这里我们不同choice，因为数据量远远超过了506

#在0到 n_samples之间取出n_missing_samples个数值，replace=False表示不要重复
missing_samples = rng.choice(n_samples ,n_missing_samples,replace=False)

创造缺失的数据集：

x_missing = x_full.copy()
y_missing = y_full.copy()

x_missing[missing_samples , missing_features] = np.nan
x_missing = pd.DataFrame(x_missing)
print(x_missing)
          0     1      2    3      4   ...   8      9     10      11    12
0        NaN  18.0    NaN  NaN  0.538  ...  1.0  296.0   NaN     NaN  4.98
1    0.02731   0.0    NaN  0.0  0.469  ...  2.0    NaN   NaN  396.90  9.14
2    0.02729   NaN   7.07  0.0    NaN  ...  2.0  242.0   NaN     NaN   NaN
3        NaN   NaN    NaN  0.0  0.458  ...  NaN  222.0  18.7     NaN   NaN
4        NaN   0.0   2.18  0.0    NaN  ...  NaN    NaN  18.7     NaN  5.33
..       ...   ...    ...  ...    ...  ...  ...    ...   ...     ...   ...
501      NaN   NaN    NaN  0.0  0.573  ...  1.0    NaN  21.0     NaN  9.67
502  0.04527   0.0  11.93  0.0  0.573  ...  1.0  273.0   NaN  396.90  9.08
503      NaN   NaN  11.93  NaN  0.573  ...  NaN    NaN  21.0     NaN  5.64
504  0.10959   0.0  11.93  NaN  0.573  ...  1.0    NaN  21.0  393.45  6.48
505  0.04741   0.0  11.93  0.0  0.573  ...  1.0    NaN   NaN  396.90  7.88

注意，这里不处理y_missing，特征可以空，标签不能空，标签空了就变成无监督学习了，我们所谓的填补缺失值是填补特征矩阵中的缺失值

使用均值进行填充：

imp_mean = SimpleImputer(missing_values=np.nan , strategy="mean")	#实例化
#将x_missing里所有的值导入到此模型中训练，然后填完均值返回结果
x_missing_mean = imp_mean.fit_transform(x_missing)
print(pd.DataFrame(x_missing_mean))
           0          1          2   ...         10          11         12
0    3.627579  18.000000  11.163464  ...  18.521192  352.741952   4.980000
1    0.027310   0.000000  11.163464  ...  18.521192  396.900000   9.140000
2    0.027290  10.722951   7.070000  ...  18.521192  352.741952  12.991767
3    3.627579  10.722951  11.163464  ...  18.700000  352.741952  12.991767
4    3.627579   0.000000   2.180000  ...  18.700000  352.741952   5.330000
..        ...        ...        ...  ...        ...         ...        ...
501  3.627579  10.722951  11.163464  ...  21.000000  352.741952   9.670000
502  0.045270   0.000000  11.930000  ...  18.521192  396.900000   9.080000
503  3.627579  10.722951  11.930000  ...  21.000000  352.741952   5.640000
504  0.109590   0.000000  11.930000  ...  21.000000  393.450000   6.480000
505  0.047410   0.000000  11.930000  ...  18.521192  396.900000   7.880000

[506 rows x 13 columns]

使用0进行填补：

imp_0 = SimpleImputer(missing_values=np.nan , strategy="constant" , fill_value=0)
x_missing_0 = imp_0.fit_transform(x_missing)
print(pd.DataFrame(x_missing_0))

          0     1      2    3      4   ...   8      9     10      11    12
0    0.00000  18.0   0.00  0.0  0.538  ...  1.0  296.0   0.0    0.00  4.98
1    0.02731   0.0   0.00  0.0  0.469  ...  2.0    0.0   0.0  396.90  9.14
2    0.02729   0.0   7.07  0.0  0.000  ...  2.0  242.0   0.0    0.00  0.00
3    0.00000   0.0   0.00  0.0  0.458  ...  0.0  222.0  18.7    0.00  0.00
4    0.00000   0.0   2.18  0.0  0.000  ...  0.0    0.0  18.7    0.00  5.33
..       ...   ...    ...  ...    ...  ...  ...    ...   ...     ...   ...
501  0.00000   0.0   0.00  0.0  0.573  ...  1.0    0.0  21.0    0.00  9.67
502  0.04527   0.0  11.93  0.0  0.573  ..

最低0.47元/天解锁文章