类别变量的多热编码：encoding categorical variable to multihot embedding

最新推荐文章于 2024-03-07 14:58:21 发布

meta life

最新推荐文章于 2024-03-07 14:58:21 发布

阅读量910

点赞数

分类专栏：短码文章标签： python 机器学习

本文链接：https://blog.csdn.net/pkuhyd/article/details/117920928

版权

本文介绍如何将类别或字符串类型的特征，如由逗号或竖线分隔的数据，转换为多热编码。这种转换在机器学习中常见，用于将非数值特征转化为可以输入模型的形式。

摘要由CSDN通过智能技术生成

需求：把类别或字符串类型的特征转化为多热编码，特征是逗号、竖线等方式分割

import numpy as np
import pandas as pd
from scipy import sparse


class MultiHotEncoder:
    """
    Encode categorical features as a multi-hot numeric array.

    Parameters
    ----------
    sep : string, default='|', the separation string.

    Attributes
    ----------
    categories_ : a dictionary of encoding results.

    Examples
    --------

    >>> enc = MultiHotEncoder()
    >>> X = ['red|green', 'green', 'red|yellow']
    >>> enc.fit(X)
    >>> enc.categories_
    {'red': 0, 'green': 1, 'yellow': 2}
    >>> enc.transform(['green', 'yellow|red', None, 'red|green|yellow'])
    array([[0., 1., 0.],
           [1., 0., 1.],
           [0., 0., 0.],
           [1., 1., 1.]])
    """

    def __init__(self, sep='|')