keras.layers.StringLookup 层介绍

会发paper的学渣

已于 2023-01-04 18:22:51 修改

阅读量3.8k

点赞数 2

分类专栏： tensorflow2.x 文章标签： tensorflow

于 2022-03-28 17:41:59 首次发布

本文链接：https://blog.csdn.net/sslfk/article/details/123799475

版权

tensorflow2.x 专栏收录该内容

39 篇文章

订阅专栏

应用说明：改层可以返回indices编码，onehot编码，multihot编码；逆转编码，可以根据indices返回对应的原始词。

indices编码作为默认编码：

import tensorflow as tf

vocab = ["a", "b", "c", "d"]
data = tf.constant([["a", "c", "d"], ["d", "z", "b"]])
#Stringlookup的outputmod默认为int编码，也就是返回对应文本的索引编码
layer = tf.keras.layers.StringLookup(vocabulary=vocab)
layer(data)
#输出如下：
<tf.Tensor: shape=(2, 3), dtype=int64, numpy=
array([[1, 3, 4],
       [4, 0, 2]], dtype=int64)>



layer.get_vocabulary()
#输出：
['[UNK]', 'a', 'b', 'c', 'd']

oov(out of vocabulary)数量设置：StringLookup允许设置oov的数量，增加oov的数量可以在一定程度上增加模型的可靠性（未知单词oov的方式是通过hash碰撞方式分摊到oov指定数量的位置中去[0,num_oov_indices)

#注意，StringLookup的输出维度必须是二维

另外，当output_mode配置为非int时，输出的矩阵的秩最大为2

one-hot编码时，由于输出矩阵的秩最大为2，所以：

vocab = ["a", "b", "c", "d"]
layer = tf.keras.layers.StringLookup(vocabulary=vocab, output_mode='one_hot')
layer(tf.constant(["a", "b", "c", "d", "z"]))
#输出为：
  <tf.Tensor: shape=(5, 5), dtype=float32, numpy=
    array([[0., 1., 0., 0., 0.],
           [0., 0., 1., 0., 0.],
           [0., 0., 0., 1., 0.],
           [0., 0., 0., 0., 1.],
           [1., 0., 0., 0., 0.]], dtype=float32)>

layer(tf.constant('a'))
#输出为：
<tf.Tensor: shape=(5,), dtype=float32, numpy=array([0., 1., 0., 0., 0.], dtype=float32)>

multi-hot:

vocab = ["a", "b", "c", "d"]
layer = tf.keras.layers.StringLookup(vocabulary=vocab, output_mode='multi_hot')
layer(tf.constant([["a", "c", "d", "d"], ["d", "z", "b", "z"]]))
#输出为：
  <tf.Tensor: shape=(2, 5), dtype=float32, numpy=
    array([[0., 1., 0., 1., 1.],
           [1., 0., 1., 0., 1.]], dtype=float32)>


layer(tf.constant(["a", "c", "d", "d"]))
#输出为：
<tf.Tensor: shape=(5,), dtype=float32, numpy=array([0., 1., 0., 1., 1.], dtype=float32)>

count:在multi-hot的基础上，对每一个位置出现的数量也进行统计：

vocab = ["a", "b", "c", "d"]
layer = tf.keras.layers.StringLookup(vocabulary=vocab, output_mode='count')
layer(tf.constant([["a", "c", "d", "d"], ["d", "z", "b", "z"]]))
#输出为：
<tf.Tensor: shape=(2, 5), dtype=float32, numpy=
    array([[0., 1., 0., 1., 2.],
           [2., 0., 1., 0., 1.]], dtype=float32)>


layer(tf.constant(["a", "c", "d", "d"]))
#输出为：
<tf.Tensor: shape=(5,), dtype=float32, numpy=array([0., 1., 0., 1., 2.], dtype=float32)>

TF-IDF:在count基础上，增加了tf-idf的权重计算

vocab = ["a", "b", "c", "d"]
idf_weights = [0.25, 0.75, 0.6, 0.4]
layer = tf.keras.layers.StringLookup(output_mode="tf_idf")
layer.set_vocabulary(vocab, idf_weights=idf_weights)
layer.get_vocabulary()

['[UNK]', 'a', 'b', 'c', 'd']

layer(tf.constant([["a", "c", "d", "d"], ["d", "z", "b", "z"]]))
#输出，其实就是每个位置的count值乘以对应的idf权重，tf*idf
<tf.Tensor: shape=(2, 5), dtype=float32, numpy=
array([[0.  , 0.25, 0.  , 0.6 , 0.8 ],
       [1.  , 0.  , 0.75, 0.  , 0.4 ]], dtype=float32)>

Inverse lookup:将索引indices映射为字符串

vocab = ["a", "b", "c", "d"]
data = tf.constant([[1, 3, 4], [4, 0, 2]])
layer = tf.keras.layers.StringLookup(vocabulary=vocab, invert=True)
layer(data)
#输出为对应的词典字符串：
array([[b'a', b'c', b'd'],
       [b'd', b'[UNK]', b'b']], dtype=object)>

#说明，对中文暂不支持，需要额外的编码转化处理

最后增加一个StringLookup在Model中的使用，此处使用tf.keras.layers直接导入包，避免部分使用tf.python.keras导入，部分使用tf.keras导入，容易导致包的不一致问题：

input3 = tf.keras.layers.Input(shape=(1,),name="string_test",dtype=tf.string)
my_string_lookup = tf.keras.layers.StringLookup(vocabulary=["世界","你","good", "d"])(input3)
model = tf.keras.Model(inputs=[input3], outputs=my_string_lookup)
print(model.predict(["世界","你","good", "e"]))