缺失数据填补基础方法（1）——k-Nearest Neighbors (kNN) 填补

wendy_ya

已于 2022-08-29 14:17:26 修改

阅读量1w

点赞数 3

分类专栏：基于GAN的离散缺失数据填补 python 文章标签：机器学习 sklearn python

于 2022-06-07 16:21:00 首次发布

本文链接：https://blog.csdn.net/didi_ya/article/details/125167621

版权

基于GAN的离散缺失数据填补同时被 2 个专栏收录

14 篇文章 ¥199.90 ¥299.90

订阅专栏

超级会员免费看

python

103 篇文章

订阅专栏

本文介绍了k-Nearest Neighbors (kNN) 算法在填充缺失数据中的应用，通过kNNImputer类，利用训练集中n_neighbors个最近邻的值估算缺失值，当邻居数量不足时，使用训练集平均值。并提供了代码示例和API实例。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

一、kNN介绍

kNNImputer类提供了使用k-Nearest Neighbors（KNN）算法完成缺失值的填补。每个样本的缺失值都是使用在训练集中找到的n_neighbors个近邻的值来估算的，请注意，如果一个样本缺少多个特征，则该样本可以会有多组n_neighbors邻域供体，具体取决于填补的特定特征。

然后，将每个缺失特征填补为这些邻居的加权或未加权平均值。如果donor neighbors的数量少于n_neighbors，则使用该特征的训练集的平均值进行填补。当然，训练集中的样本总数始终大于或等于可用于填补的最近邻数。这取决于总体样本量以及由于缺失特征太多而从最近邻计算中排除的样本数（由row_max_missing控制）

二、代码示例

下面的代码段演示如何将缺失值替换为np.nan，使用包含缺失值的行的两个最近邻的平均特征值：

>>> import numpy as np
>>> from missingpy import KNNImputer
>>> nan = np.nan
>>> X = [[1, 2, nan], [3, 4, 3], [nan, 6, 5], [8, 8, 7]]
>>> imputer = KNNImputer(n_neighbors=2, weights="uniform")
>>> imputer.fit_transform(X)
array([[1. , 2. , 4. ],
       [3. , 4. , 3. ],
       [5.5, 6. , 5. ],
       [8. , 8. , 7. ]])

如果上述代码报错，则使用如下代码：

>>> import numpy as np
>>> from sklearn.impute import KNNImputer
>>> nan = np.nan
>>> X = [[1, 2, nan], [3, 4, 3], [nan, 6, 5], [8, 8, 7]]
>>> imputer = KNNImputer(n_neighbors=2, weights="uniform")
>>> imputer.fit_transform(X)
array([[1. , 2. , 4. ],
       [3. , 4. , 3. ],
       [5.5, 6. , 5. ],
       [8. , 8. , 7. ]])

三、API实例

KNNImputer的API如下：

KNNImputer(missing_values="NaN", n_neighbors=5, weights="uniform", 
                 metric="masked_euclidean", row_max_missing=0.5, 
                 col_max_missing=0.8, copy=True)
             
Parameters
----------
missing_values : integer or "NaN", optional (default = "NaN")
    The placeholder for the missing values. All occurrences of
    `missing_values` will be imputed. For missing values encoded as
    ``np.nan``, use the string value "NaN".

n_neighbors : int, optional (default = 5)
    Number of neighboring samples to use for imputation.

weights : str or callable, optional (default = "uniform")
    Weight function used in prediction.  Possible values:

    - 'uniform' : uniform weights.  All points in each neighborhood
      are weighted equally.
    - 'distance' : weight points by the inverse of their distance.
      in this case, closer neighbors of a query point will have a
      greater influence than neighbors which are further away.
    - [callable] : a user-defined function which accepts an
      array of distances, and returns an array of the same shape
      containing the weights.

metric : str or callable, optional (default = "masked_euclidean")
    Distance metric for searching neighbors. Possible values:
    - 'masked_euclidean'
    - [callable] : a user-defined function which conforms to the
    definition of _pairwise_callable(X, Y, metric, **kwds). In other
    words, the function accepts two arrays, X and Y, and a
    ``missing_values`` keyword in **kwds and returns a scalar distance
    value.

row_max_missing : float, optional (default = 0.5)
    The maximum fraction of columns (i.e. features) that can be missing
    before the sample is excluded from nearest neighbor imputation. It
    means that such rows will not be considered a potential donor in
    ``fit()``, and in ``transform()`` their missing feature values will be
    imputed to be the column mean for the entire dataset.

col_max_missing : float, optional (default = 0.8)
    The maximum fraction of rows (or samples) that can be missing
    for any feature beyond which an error is raised.

copy : boolean, optional (default = True)
    If True, a copy of X will be created. If False, imputation will
    be done in-place whenever possible. Note that, if metric is
    "masked_euclidean" and copy=False then missing_values in the
    input matrix X will be overwritten with zeros.

Attributes
----------
statistics_ : 1-D array of length {n_features}
    The 1-D array contains the mean of each feature calculated using
    observed (i.e. non-missing) values. This is used for imputing
    missing values in samples that are either excluded from nearest
    neighbors search because they have too many ( > row_max_missing)
    missing features or because all of the sample's k-nearest neighbors
    (i.e., the potential donors) also have the relevant feature value
    missing.

Methods
-------
fit(X, y=None):
    Fit the imputer on X.

    Parameters
    ----------
    X : {array-like}, shape (n_samples, n_features)
        Input data, where ``n_samples`` is the number of samples and
        ``n_features`` is the number of features.

    Returns
    -------
    self : object
        Returns self.
        
        
transform(X):
    Impute all missing values in X.

    Parameters
    ----------
    X : {array-like}, shape = [n_samples, n_features]
        The input data to complete.

    Returns
    -------
    X : {array-like}, shape = [n_samples, n_features]
        The imputed dataset.


fit_transform(X, y=None, **fit_params):
    Fit KNNImputer and impute all missing values in X.

    Parameters
    ----------
    X : {array-like}, shape (n_samples, n_features)
        Input data, where ``n_samples`` is the number of samples and
        ``n_features`` is the number of features.

    Returns
    -------
    X : {array-like}, shape (n_samples, n_features)
        Returns imputed dataset.

参考：

Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani, David Botstein and Russ B. Altman, Missing value estimation methods for DNA microarrays, BIOINFORMATICS Vol. 17 no. 6, 2001 Pages 520-525.