sklearn.LabelEncoder出现之前从未见过值

最新推荐文章于 2025-04-15 22:41:59 发布

原创

最新推荐文章于 2025-04-15 22:41:59 发布 · 2.2k 阅读

6 ·

CC 4.0 BY-SA版权

文章标签：

#sklearn #python #机器学习

该博客介绍了如何处理在测试集中出现的未在训练集中出现的新标签。提供了两种方法：一是创建一个自定义的LabelEncoder类，将新标签标记为'Unknown'；二是更新序列编码，遇到新标签时增加编码值。这两种方法都确保了对新标签的处理，并给出了具体的Python代码示例。

方法一：将陌生序列标记为Unknown

如果将LabelEncoder.transform将训练集转换为编码序列，则在测试集上使用时如果遇到新的值，则可能会报错。'<Unknown>'

我们可以继承LabelEncoder并重写fit和transform。如果您有一个新标签，它将被分配为未知类。

from sklearn.preprocessing import LabelEncoder as LEncoder

class LabelEncoder(LEncoder):
 
    def fit(self, y):
        """
        This will fit the encoder for all the unique values
        and introduce unknown value
        :param y: A list of string
        :return: self
        """
        return super(LabelEncoder, self).fit(list(y) + ['Unknown'])
 
    def transform(self, y):
        """
        This will transform the y to id list where the new values
        get assigned to Unknown class
        :param y:
        :return: array-like of shape [n_samples]
        """
        new_y = ['Unknown' if x not in set(self.classes_) else x for x in y]
        return super(LabelEncoder, self).transform(new_y)

样本用法：