Python 中的sklearn库填补缺失值问题

最新推荐文章于 2024-08-06 21:08:29 发布

lordscript

最新推荐文章于 2024-08-06 21:08:29 发布

阅读量3.9k

点赞数 2

文章标签： python 机器学习 numpy sklearn

利用Python学习ML时，了解到需要对数据中的缺失值进行处理，否则无法直接利用sklearn进行训练，缺失值得处理需要用到sklearn.preprocessing中的imputer库。
首先需要说明的是，numpy的数组中可以使用np.nan/np.NaN（Not A Number）来代替缺失值，对于数组中是否存在nan可以使用np.isnan()来判定。
使用type(np.nan)或者type(np.NaN)可以发现改值其实属于float类型，代码如下：

type(np.NaN)
type ‘float’

因此，如果要进行处理的数据集中包含缺失值一般步骤如下：
1、使用字符串’nan’来代替数据集中的缺失值；
2、将该数据集转换为浮点型便可以得到包含np.nan的数据集；
3、使用sklearn.preprocessing.Imputer类来处理使用np.nan对缺失值进行编码过的数据集。

from sklearn.preprocessing import Imputer
imp = Imputer(missing_values=’NaN’, strategy=’mean’, axis=0)
X=np.array([[1, 2], [np.nan, 3], [7, 6]])
Y=[[np.nan, 2], [6, np.nan], [7, 6]]
imp.fit(X)

Imputer(axis=0, copy=True, missing_values=’NaN’, strategy=’mean’,verbose=0)

imp.transform(Y)

array([[ 4. , 2. ],
[ 6. , 3.66666667],
[ 7. , 6. ]])

上述代码使用数组X去“训练”一个Imputer类，然后用该类的对象去处理数组Y中的缺失值，缺失值的处理方式是使用X中的均值（axis=0表示按列进行）代替Y中的缺失值。
当然也可以使用imp对象来对X数组本身进行处理。
没看太懂，继续找imputer 的介绍。
代码中imp 是imputer类的对象。其中Imputer中的参数介绍如下：

Parameters:

missing_values : integer or “NaN”, optional(default=”NaN”)

The placeholder for the missing values. All occurrences of
missing_values will be imputed. For missing values encoded as np.nan,
use the string value “NaN”.

strategy : string, optional (default=”mean”)

The imputation strategy.
If “mean”, then replace missing values using the mean along the axis.
If “median”, then replace missing values using the median along the axis.
If “most_frequent”, then replace missing using the most frequent value along the axis.

axis : integer, optional (default=0)

The axis along which to impute.
If axis=0, then impute along columns.
If axis=1, then impute along rows.

verbose : integer, optional (default=0)

Controls the verbosity of the imputer.

copy : boolean, optional (default=True)

If True, a copy of X will be created.
If False, imputation will be done in-place whenever possible.
Note that, in the following cases, a new copy will always be made, even if copy=False:
If X is not an array of floating values;
If X is sparse and missing_values=0;
If axis=0 and X is encoded as a CSR matrix;
If axis=1 and X is encoded as a CSC matrix.
Attributes:

statistics_ :array of shape (n_features,)

The imputation fill value for each feature if axis == 0.

简单来看：
1、axis参数影响计算域，axis=0，是利用矩阵X中的列进行计算，axis=1，是利用矩阵X中的行进行计算。
2、strategy参数影响计算方法，mean取平均值，median取中值，most_frequent取频率最高的值。

When axis=0, columns which only contained missing values at fit are discarded upon transform.
When axis=1, an exception is raised if there are rows for which it is not possible to fill in the missing values (e.g., because >they only contain missing values).

翻译：
axis为0，如果列中全是缺失值，无法fit的情况就被丢弃，
axis为1，如果行中全是缺失值，就发生错误。

然后用到库中的fit方法及transform方法，fit函数对X进行训练并返回imputer对象，fit函数中很奇怪的一点，
自行测试时发现
axis为0时，训练是采用X中的列进行训练，
但是axis为1时，是采用Y中的行进行训练。
但是文档中似乎并没有提到。只是提到前面所说的报错问题。

是否有人能帮忙验证以及交流这个问题？？

1 https://www.cnblogs.com/chaosimple/p/4153158.html
2 http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html
3 https://blog.csdn.net/Dream_angel_Z/article/details/49406573