【TensorFlow】(一) tf.feature_column.categorical_column_with_identity()函数的使用

最新推荐文章于 2021-11-24 16:57:08 发布

凝眸伏笔

最新推荐文章于 2021-11-24 16:57:08 发布

阅读量3.9k

点赞数 4

分类专栏： TensorFlow 文章标签： tensorflow 深度学习

本文链接：https://blog.csdn.net/pearl8899/article/details/107946519

版权

TensorFlow 专栏收录该内容

29 篇文章 8 订阅

订阅专栏

1.作用

把numerical data转乘one hot encoding。举个例子，如果是[猫，狗，猪]--->[0, 1, 2]的形式，再使用该函数进行one-hot。该函数不直接对文本形的list进行处理。

官网给的解释：一个CategoricalColumn返回标识值。

输入：

key：features是一个字典，key是特征名字，value是特征值。features[key]或者是Tensor或SparseTensor 。如果Tensor ，缺失值，可以表示为-1为int和''字符串，这将通过此功能列被删除。

num_buckets: 分桶的个数。

default_value:当你输入的范围内的整数使用此[0, num_buckets)并且要使用的输入值本身作为分类ID。超出此范围的值将导致default_value如果指定，否则就会失败。下面的例子中，在输入的文字0将导致相同的默认ID。

输出：

一个N*num_buckets的矩阵。N是样本数量。

2.例子

features:字典，最主要的是 dict的key一定要与 feature_columns的key一致，后续才能根据key进行匹配。

feature_columns：必须是继承于DenseColumn的numeric_column, embedding_column, bucketized_column, indicator_column。如果feature是类别的，那么必须先用embedding_column或者indicator_column封装一下使用。

这也就是为什么在代码中出现了indicator = tf.feature_column.indicator_column(column)。

# -*- coding: utf-8 -*-
import tensorflow as tf
sess=tf.Session()

if __name__:
    # 特征数据
    features = {
        'birthplace': [[0], [1], [1], [3], [4], [5], [6], [7]]
    }
    print "———————1———————"
    print features
    # 特征列
    birthplace = tf.feature_column.categorical_column_with_identity("birthplace", num_buckets=10, default_value=0)
    print "———————2———————"
    print birthplace
    birthplace = tf.feature_column.indicator_column(birthplace)
    print "——————3————————"
    print birthplace
    # 组合特征列
    columns = [birthplace]
    # 输入层（数据，特征列）
    inputs = tf.feature_column.input_layer(features, columns)
    print "———————4———————"    
    print inputs
    # 初始化并运行
    init = tf.global_variables_initializer()
    sess.run(tf.tables_initializer())
    sess.run(init)
    v = sess.run(inputs)
    print "———————5———————"
    print(v)

输出：

———————1———————
{'birthplace': [[0], [1], [1], [3], [4], [5], [6], [7]]}
———————2———————
IdentityCategoricalColumn(key='birthplace', number_buckets=10, default_value=0)
——————3————————
IndicatorColumn(categorical_column=IdentityCategoricalColumn(key='birthplace', number_buckets=10, default_value=0))
——————4————————
Tensor("input_layer/concat:0", shape=(8, 10), dtype=float32)
——————5————————
[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0.] #[0]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.] #[1]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.] #[1]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0.] #[3]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.] #[4]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.] #[5]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.] #[6]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]] #[7]

理解：

这里数值型特征birthplace=[[0], [1], [1], [3], [4], [5], [6], [7]]

使用categorical_column_with_identity对输入进行one-hot时：

情况一：num_buckets = max(birthplace) - min(birthplace) + 1的size，ono-hot的结果中，birthplace元素是one-hot中为1的索引。

num_buckets = 8
[[1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1.]]

情况二：num_buckets > max(birthplace) - min(birthplace) + 1的size；从下面的运行结果中可用看出，最后两列全部为0，感觉是浪费了列的存储空间，对类别映射没有帮助。

num_buckets = 10
[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]]

情况三：num_buckets < max(birthplace) - min(birthplace) + 1的size;这里将num_buckets=5,也就是birthplace的前5类正常进行one-hot处理，后面的类别在索引为0的位置为1.因为第三个参数default_value=0.

num_buckets = 5
[[1. 0. 0. 0. 0.] #[0]
 [0. 1. 0. 0. 0.] #[1]
 [0. 1. 0. 0. 0.] #[1]
 [0. 0. 0. 1. 0.] #[3]
 [0. 0. 0. 0. 1.] #[4]
 [1. 0. 0. 0. 0.] #[5]超出bucket的，归为default_value,onehot值同default_value=0的情况
 [1. 0. 0. 0. 0.] #[6]
 [1. 0. 0. 0. 0.]] #[7]

, , , , ,, ,

3.知识拓展

3.1 tf.Session()

作用：Session 是 Tensorflow 为了控制,和输出文件的执行的语句. 运行 session.run() 可以获得你要得知的运算结果, 或者是你所要运算的部分. 下面的例子，product 只是定义了它如何计算，并不是直接计算的步骤, 所以我们会要使用 Session 来激活 product 并得到计算结果. 有两种形式使用会话控制 Session

例子：

# -*- coding: utf-8 -*-
import tensorflow as tf

matrix1 = tf.constant([[3, 3]])
matrix2 = tf.constant([[2],
                        [2]])

product = tf.matmul(matrix1, matrix2)
print "—————1—————"
print product #此时的product是一个为0的tensor

sess = tf.Session()
result = sess.run(product)
print "—————2—————"
print(result) ##此时的product是一个为12的tensor，经过了matrix1和matrix2的相乘运算

输出：

—————1—————
Tensor("MatMul:0", shape=(1, 1), dtype=int32)
—————2—————
[[12]]

3.2 tf.global_variables_initializer()

作用：添加节点用于初始化全局变量(GraphKeys.GLOBAL_VARIABLES)。返回一个初始化所有全局变量的操作（Op）。

能够将所有的变量一步到位的初始化，非常的方便。

示例代码如下：

# -*- coding: utf-8 -*-
import tensorflow as tf

w1 = tf.Variable(tf.random_normal((2, 3), stddev=1, seed=1))#定义方法并没真正运行
x = tf.constant([[0.1, 0.9]])
y = tf.matmul(x, w1)

print "—————1—————"
print y

sess = tf.Session()
sess.run(w1.initializer) #将所有的变量初始化
result = sess.run(y)
print "—————2—————"
print result

输出：

—————1—————
Tensor("MatMul:0", shape=(1, 3), dtype=float32)
—————2—————
[[-2.2795656   0.23778343  0.5386348 ]]

如果没有将w1初始化，会出现什么情况?将 sess.run(w1.initializer)屏蔽掉。

报错：tensorflow.python.framework.errors_impl.FailedPreconditionError: Attempting to use uninitialized value Variable
[[node Variable/read (defined at /PycharmProjects/testPro/temp.py:4) ]]

所以需要显示的对变量进行初始化，即sess.run(w1.initializer)。

如果很多变量怎么办？总不能一个个去初始化，这样可能一屏代码都是sess.run(w1.initializer)。

TensorFlow提供了一种更加便捷的方式来完成变量初始化过程，那就是调用tf.global_variable_initializer().

参考：

1.《TensorFlow实战Google深度学习框架》

2.官方文档：https://www.tensorflow.org/api_docs/python/tf/feature_column/categorical_column_with_identity

3.阿里以为工程师的解释：https://zhuanlan.zhihu.com/p/73701872

凝眸伏笔

关注

4
点赞
踩
14

收藏

觉得还不错? 一键收藏
2
评论
【TensorFlow】(一) tf.feature_column.categorical_column_with_identity()函数的使用

tf.feature_column.categorical_column_with_identity()函数的使用；tf.Session()的作用；tf.global_variables_initializer()的作用。
复制链接

扫一扫