sklearn.preprocess.LabelEncoder出现从未见过值的处理方法

https://blog.csdn.net/qq_19446965/article/details/120110169

方法一:将陌生序列标记为Unknown

如果将LabelEncoder.transform将训练集转换为编码序列,则在测试集上使用时如果遇到新的值,则可能会报错。'<Unknown>'

我们可以继承LabelEncoder并重写fit和transform。 如果您有一个新标签,它将被分配为未知类。

from sklearn.preprocessing import LabelEncoder as LEncoder
 
class LabelEncoder(LEncoder):
 
    def fit(self, y):
        """
        This will fit the encoder for all the unique values
        and introduce unknown value
        :param y: A list of string
        :return: self
        """
        return super(LabelEncoder, self).fit(list(y) + ['Unknown'])
 
    def transform(self, y):
        """
        This will transform the y to id list where the new values
        get assigned to Unknown class
        :param y:
        :return: array-like of shape [n_samples]
        """
        new_y = ['Unknown' if x not in set(self.classes_) else x for x in y]
        return super(LabelEncoder, self).transform(new_y)
 

样本用法:

country_list = ['Argentina', 'Australia', 'Canada', 'France', 'Italy', 'Spain', 'US', 'Canada', 'Argentina, ''US']
 
label_encoder = LabelEncoder()
 
label_encoder.fit(country_list)
print('country_list: ', label_encoder.classes_)  # you can see new class called Unknown
print('encode_country_list: ', label_encoder.transform(country_list))
 
new_country_list = ['Canada', 'France', 'Italy', 'Spain', 'US', 'India', 'Pakistan', 'South Africa']
print('new_encode_country_list: ', label_encoder.transform(new_country_list))

运行结果:

country_list:  ['Argentina' 'Argentina, US' 'Australia' 'Canada' 'France' 'Italy' 'Spain' 'US' 'Unknown']
encode_country_list:  [0 2 3 4 5 6 7 3 1]
new_encode_country_list:  [3 4 5 6 7 8 8 8]

结果编码为8、8、8。

方法二:更新序列编码

还有另一种方法就是更新序列编码:

  1. 维护一个序列list,存无重复序列features ;
  2. 当有新的序列的时候,加入到序列features ,更新list;
from sklearn.preprocessing import LabelEncoder as LEncoder
 
class LabelEncoder(LEncoder):
    def __init__(self):
        """
        It differs from LabelEncoder by handling new classes
        and increase the values for it.
        """
        self.features = tuple()
 
    def fit(self, y):
        """
        This will fit the encoder for all the unique values
        and introduce unknown value
        :param y: A tuple of sequence(string)
        :return: self
        """
        self.set_features(tuple(set(y)))
        return super(LabelEncoder, self).fit(self.encode_seqs(y))
 
    def add_features(self, new_features):
        """
        Add features
        @param new_features: A list of features(string)
        """
        self.features = self.features + new_features
 
    def set_features(self, new_features):
        """
        Set features
        @param new_features: A list of features(string)
        """
        self.features = new_features
 
    def transform(self, y):
        """
        This will transform the y to id list where the new values
        get assigned to Unknown class
        :param y: A list of sequence(string)
        :return: array-like of shape [n_samples]
        """
        increase_features = tuple(x for x in y if x not in set(self.features))
        if increase_features:
            self.add_features(increase_features)
            super(LabelEncoder, self).fit(self.encode_seqs(self.features))
            print(f"new classes_: {self.classes_}")
 
        return super(LabelEncoder, self).transform(self.encode_seqs(y))
 
    def encode_seqs(self, seqs):
        """
        Encode a sequence as a list of numbers
        @param seqs: A list of sequence(string)
        @return: array-like of shape [n_seqs]
        """
        return [self.features.index(x) for x in seqs]

运行结果:

country_list:  [0 1 2 3 4 5 6 7]
encode_country_list:  [1 0 7 6 5 4 3 7 2]
new classes_: [ 0  1  2  3  4  5  6  7  8  9 10]
new_encode_country_list:  [ 7  6  5  4  3  8  9 10]


结果编码为8、9、10。
————————————————
版权声明:本文为CSDN博主「Rnan-prince」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/qq_19446965/article/details/120110169

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值