缺失数据填补基础方法(1)——k-Nearest Neighbors (kNN) 填补

本文介绍了k-Nearest Neighbors (kNN) 算法在填充缺失数据中的应用,通过kNNImputer类,利用训练集中n_neighbors个最近邻的值估算缺失值,当邻居数量不足时,使用训练集平均值。并提供了代码示例和API实例。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >


安装:

pip install missingpy

一、kNN介绍

kNNImputer类提供了使用k-Nearest Neighbors(KNN)算法完成缺失值的填补。每个样本的缺失值都是使用在训练集中找到的n_neighbors个近邻的值来估算的,请注意,如果一个样本缺少多个特征,则该样本可以会有多组n_neighbors邻域供体,具体取决于填补的特定特征。

然后,将每个缺失特征填补为这些邻居的加权或未加权平均值。如果donor neighbors的数量少于n_neighbors,则使用该特征的训练集的平均值进行填补。当然,训练集中的样本总数始终大于或等于可用于填补的最近邻数。这取决于总体样本量以及由于缺失特征太多而从最近邻计算中排除的样本数(由row_max_missing控制)

二、代码示例

下面的代码段演示如何将缺失值替换为np.nan,使用包含缺失值的行的两个最近邻的平均特征值:

>>> import numpy as np
>>> from missingpy import KNNImputer
>>> nan = np.nan
>>> X = [[1, 2, nan], [3, 4, 3], [nan, 6, 5], [8, 8, 7]]
>>> imputer = KNNImputer(n_neighbors=2, weights="uniform")
>>> imputer.fit_transform(X)
array([[1. , 2. , 4. ],
       [3. , 4. , 3. ],
       [5.5, 6. , 5. ],
       [8. , 8. , 7. ]])

如果上述代码报错,则使用如下代码:

>>> import numpy as np
>>> from sklearn.impute import KNNImputer
>>> nan = np.nan
>>> X = [[1, 2, nan], [3, 4, 3], [nan, 6, 5], [8, 8, 7]]
>>> imputer = KNNImputer(n_neighbors=2, weights="uniform")
>>> imputer.fit_transform(X)
array([[1. , 2. , 4. ],
       [3. , 4. , 3. ],
       [5.5, 6. , 5. ],
       [8. , 8. , 7. ]])

三、API实例

KNNImputer的API如下:

KNNImputer(missing_values="NaN", n_neighbors=5, weights="uniform", 
                 metric="masked_euclidean", row_max_missing=0.5, 
                 col_max_missing=0.8, copy=True)
             
Parameters
----------
missing_values : integer or "NaN", optional (default = "NaN")
    The placeholder for the missing values. All occurrences of
    `missing_values` will be imputed. For missing values encoded as
    ``np.nan``, use the string value "NaN".

n_neighbors : int, optional (default = 5)
    Number of neighboring samples to use for imputation.

weights : str or callable, optional (default = "uniform")
    Weight function used in prediction.  Possible values:

    - 'uniform' : uniform weights.  All points in each neighborhood
      are weighted equally.
    - 'distance' : weight points by the inverse of their distance.
      in this case, closer neighbors of a query point will have a
      greater influence than neighbors which are further away.
    - [callable] : a user-defined function which accepts an
      array of distances, and returns an array of the same shape
      containing the weights.

metric : str or callable, optional (default = "masked_euclidean")
    Distance metric for searching neighbors. Possible values:
    - 'masked_euclidean'
    - [callable] : a user-defined function which conforms to the
    definition of _pairwise_callable(X, Y, metric, **kwds). In other
    words, the function accepts two arrays, X and Y, and a
    ``missing_values`` keyword in **kwds and returns a scalar distance
    value.

row_max_missing : float, optional (default = 0.5)
    The maximum fraction of columns (i.e. features) that can be missing
    before the sample is excluded from nearest neighbor imputation. It
    means that such rows will not be considered a potential donor in
    ``fit()``, and in ``transform()`` their missing feature values will be
    imputed to be the column mean for the entire dataset.

col_max_missing : float, optional (default = 0.8)
    The maximum fraction of rows (or samples) that can be missing
    for any feature beyond which an error is raised.

copy : boolean, optional (default = True)
    If True, a copy of X will be created. If False, imputation will
    be done in-place whenever possible. Note that, if metric is
    "masked_euclidean" and copy=False then missing_values in the
    input matrix X will be overwritten with zeros.

Attributes
----------
statistics_ : 1-D array of length {n_features}
    The 1-D array contains the mean of each feature calculated using
    observed (i.e. non-missing) values. This is used for imputing
    missing values in samples that are either excluded from nearest
    neighbors search because they have too many ( > row_max_missing)
    missing features or because all of the sample's k-nearest neighbors
    (i.e., the potential donors) also have the relevant feature value
    missing.

Methods
-------
fit(X, y=None):
    Fit the imputer on X.

    Parameters
    ----------
    X : {array-like}, shape (n_samples, n_features)
        Input data, where ``n_samples`` is the number of samples and
        ``n_features`` is the number of features.

    Returns
    -------
    self : object
        Returns self.
        
        
transform(X):
    Impute all missing values in X.

    Parameters
    ----------
    X : {array-like}, shape = [n_samples, n_features]
        The input data to complete.

    Returns
    -------
    X : {array-like}, shape = [n_samples, n_features]
        The imputed dataset.


fit_transform(X, y=None, **fit_params):
    Fit KNNImputer and impute all missing values in X.

    Parameters
    ----------
    X : {array-like}, shape (n_samples, n_features)
        Input data, where ``n_samples`` is the number of samples and
        ``n_features`` is the number of features.

    Returns
    -------
    X : {array-like}, shape (n_samples, n_features)
        Returns imputed dataset.       

参考:

  1. Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani, David Botstein and Russ B. Altman, Missing value estimation methods for DNA microarrays, BIOINFORMATICS Vol. 17 no. 6, 2001 Pages 520-525.
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

wendy_ya

您的鼓励将是我创作的最大动力~

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值