NLP模型笔记 — 独热编码

14 篇文章 0 订阅
3 篇文章 0 订阅

NLP模型笔记 — 独热编码 [总结]


概念

名称独热编码
别名One-Hot编码(一位有效编码)
介绍使用N位状态寄存器来对N个状态进行编码,每个状态都有它独立的寄存器位,并且在任意时候,其中只有一位有效。

原理

# coding=utf-8
data = [
    ['tall', 'Chinese', 'men'],
    ['short', 'American', 'women'],
    ['tall', 'Japanese', 'men']
]
sample = ['tall', 'American', 'women']

print("数据集:")
_ = [print("  %s" % str(d)) for d in data]
print("示例:")
print("  %s" % str(sample))

print()
print("解:")

print("整理数据中的各种属性")
height = ['tall', 'short']  # 身高:高,矮
country = ['Chinese', 'American', 'Japanese']  # 国家:中国的,美国的,日本的
gender = ['men', 'women']  # 性别:男,女

sample = ['tall', 'American', 'women']  

features = [height, country, gender]
_ = [print("  %s" % feature) for feature in features]
print("对属性排序")
for index, feature in enumerate(features):
    features[index] = sorted(feature)
    print("  %s after sorting: %s" % (feature, features[index]))  # 每个特征均排序

print("使用排序后的特征列表,对每个属性进行编码")
for index, feature in enumerate(sample):
    vec = [0] * len(features[index])
    vec[features[index].index(feature)] = 1
    print("  %s: %s" % (feature, vec))
    
print("所以")
encoded_vector = []
for index, feature in enumerate(features):
    vec = [0] * len(feature)
    vec[feature.index(sample[index])] = 1
    encoded_vector += vec
print('  encoded vector: %s' % str(encoded_vector))
数据集:
  ['tall', 'Chinese', 'men']
  ['short', 'American', 'women']
  ['tall', 'Japanese', 'men']
示例:
  ['tall', 'American', 'women']

解:
整理数据中的各种属性
  ['tall', 'short']
  ['Chinese', 'American', 'Japanese']
  ['men', 'women']
对属性排序
  ['tall', 'short'] after sorting: ['short', 'tall']
  ['Chinese', 'American', 'Japanese'] after sorting: ['American', 'Chinese', 'Japanese']
  ['men', 'women'] after sorting: ['men', 'women']
使用排序后的特征列表,对每个属性进行编码
  tall: [0, 1]
  American: [1, 0, 0]
  women: [0, 1]
所以
  encoded vector: [0, 1, 1, 0, 0, 0, 1]

实现

# coding=utf-8
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()

data = [
    ['tall', 'Chinese', 'men'],
    ['short', 'American', 'women'],
    ['tall', 'Japanese', 'men']
]

sample = ['tall', 'American', 'women']

encoder.fit(data)

encoded_vector = encoder.transform([sample]).toarray()[0]

print("数据集:")
_ = [print("  %s" % str(d)) for d in data]
print("示例:")
print("  %s" % str(sample))

print("结果:")
print("  encoded vector: %s" % str(encoded_vector))
数据集:
  ['tall', 'Chinese', 'men']
  ['short', 'American', 'women']
  ['tall', 'Japanese', 'men']
示例:
  ['tall', 'American', 'women']
结果:
  encoded vector: [0. 1. 1. 0. 0. 0. 1.]

参考

概念:百度百科
介绍:独热编码

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值