NLP模型笔记 — 独热编码 [总结]
概念
名称 | 独热编码 |
---|
别名 | One-Hot编码(一位有效编码) |
介绍 | 使用N位状态寄存器来对N个状态进行编码,每个状态都有它独立的寄存器位,并且在任意时候,其中只有一位有效。 |
原理
例
data = [
['tall', 'Chinese', 'men'],
['short', 'American', 'women'],
['tall', 'Japanese', 'men']
]
sample = ['tall', 'American', 'women']
print("数据集:")
_ = [print(" %s" % str(d)) for d in data]
print("示例:")
print(" %s" % str(sample))
print()
print("解:")
print("整理数据中的各种属性")
height = ['tall', 'short']
country = ['Chinese', 'American', 'Japanese']
gender = ['men', 'women']
sample = ['tall', 'American', 'women']
features = [height, country, gender]
_ = [print(" %s" % feature) for feature in features]
print("对属性排序")
for index, feature in enumerate(features):
features[index] = sorted(feature)
print(" %s after sorting: %s" % (feature, features[index]))
print("使用排序后的特征列表,对每个属性进行编码")
for index, feature in enumerate(sample):
vec = [0] * len(features[index])
vec[features[index].index(feature)] = 1
print(" %s: %s" % (feature, vec))
print("所以")
encoded_vector = []
for index, feature in enumerate(features):
vec = [0] * len(feature)
vec[feature.index(sample[index])] = 1
encoded_vector += vec
print(' encoded vector: %s' % str(encoded_vector))
数据集:
['tall', 'Chinese', 'men']
['short', 'American', 'women']
['tall', 'Japanese', 'men']
示例:
['tall', 'American', 'women']
解:
整理数据中的各种属性
['tall', 'short']
['Chinese', 'American', 'Japanese']
['men', 'women']
对属性排序
['tall', 'short'] after sorting: ['short', 'tall']
['Chinese', 'American', 'Japanese'] after sorting: ['American', 'Chinese', 'Japanese']
['men', 'women'] after sorting: ['men', 'women']
使用排序后的特征列表,对每个属性进行编码
tall: [0, 1]
American: [1, 0, 0]
women: [0, 1]
所以
encoded vector: [0, 1, 1, 0, 0, 0, 1]
实现
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
data = [
['tall', 'Chinese', 'men'],
['short', 'American', 'women'],
['tall', 'Japanese', 'men']
]
sample = ['tall', 'American', 'women']
encoder.fit(data)
encoded_vector = encoder.transform([sample]).toarray()[0]
print("数据集:")
_ = [print(" %s" % str(d)) for d in data]
print("示例:")
print(" %s" % str(sample))
print("结果:")
print(" encoded vector: %s" % str(encoded_vector))
数据集:
['tall', 'Chinese', 'men']
['short', 'American', 'women']
['tall', 'Japanese', 'men']
示例:
['tall', 'American', 'women']
结果:
encoded vector: [0. 1. 1. 0. 0. 0. 1.]
参考
概念:百度百科
介绍:独热编码