【TensorFlow】onehot编码

最新推荐文章于 2024-07-25 17:32:42 发布

路途…

最新推荐文章于 2024-07-25 17:32:42 发布

阅读量905

点赞数

分类专栏：算法层--Tensorflow

原文链接：https://blog.csdn.net/u013385925/article/details/80142310

版权

算法层--Tensorflow 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

def onehot(labels):
　　'''one-hot 编码'''
　　#数据有几行输出
　　n_sample = len(labels)
　　#数据分为几类。因为编码从0开始所以要加1
　　n_class = max(labels) + 1
　　#建立一个batch所需要的数组，全部赋0.
　　onehot_labels = np.zeros((n_sample, n_class))
　　#对每一行的，对应分类赋1
　　onehot_labels[np.arange(n_sample), labels] = 1
　　return onehot_labels

onehot_labels[np.arange(n_sample), labels] = 1 #对应行和列

运行结果：

label=np.array([0,1,2])
onehot(label)

Out[8]:

array([[ 1., 0., 0.],
[ 0., 1., 0.],
[ 0., 0., 1.]])

label=np.array([0,4,7,1,1,1,4,1])
onehot(label)

Out[10]:

array([[ 1., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 1.],
[ 0., 1., 0., 0., 0., 0., 0., 0.],
[ 0., 1., 0., 0., 0., 0., 0., 0.],
[ 0., 1., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0., 0.],
[ 0., 1., 0., 0., 0., 0., 0., 0.]])

为什么使用one-hot编码来处理离散型特征，具体原因，分下面几点来阐述：
1、Why do we binarize categorical features?
We binarize the categorical input so that they can be thought of as a vector from the Euclidean space (we call this as embedding the vector in the Euclidean space).使用one-hot编码，将离散特征的取值扩展到了欧式空间，离散特征的某个取值就对应欧式空间的某个点。

2、Why do we embed the feature vectors in the Euclidean space?
Because many algorithms for classification/regression/clustering etc. requires computing distances between features or similarities between features. And many definitions of distances and similarities are defined over features in Euclidean space. So, we would like our features to lie in the Euclidean space as well.将离散特征通过one-hot编码映射到欧式空间，是因为，在回归，分类，聚类等机器学习算法中，特征之间距离的计算或相似度的计算是非常重要的，而我们常用的距离或相似度的计算都是在欧式空间的相似度计算，计算余弦相似性，基于的就是欧式空间。