luffy在使用keras中Embedding层的时候, 碰到了这样的问题:
- 如果我的vocabulary_size是10, 但是我训练集中的数据的词汇编号并不能完全覆盖词汇表, 比方说训练集里面只使用到了6-10的词;
vocabulary: {1,2,3,4,5,6,7,8,9,10}
train_pairs: [[6, 1], [7, 1], [8,1], [9,1], [10, 1], ]
那么问题来了, embedding层中对应vocab[1-5]的权重系数部分是否能够被训练到?
我们知道, word vector的关键在于其权重, 也就是embedding层, 从编号为i的词语映射到word vector相当于在embedding层中选择对应的第i行向量. 如下, 手工画图:
实验
import keras
import numpy as np
#生成训练数据
x_train = np.random.randint(6,10+1, size=(100, 1), dtype=np.int32)
y_train = np.random.randn(100, 1)
def build(vocabulary_size, word_vector_dim=10):
""" 构建模型, embedding层被初始化为全1
"""
input_user = keras.layers.Input(shape=(None,), dtype="int32")
user_vector = keras.layers.Embedding(vocabulary_size+1, word_vector_dim, input_length=1, embeddings_initializer='ones')(input_user)
user_vector = keras.layers.Reshape((word_vector_dim, ))(user_vector)
outs = keras.layers.Dense(1, activation='sigmoid')(user_vector)
model = keras.models.Model(inputs=input_user, outputs=outs)
return model
model = build( 10, word_vector_dim=3)
model.compile(optimizer='adam', loss='mse', metrics=['acc'])
#{输出初始化的embedding层权重
weights = np.array(model.get_weights())
print(f'before trainig {weights[0]}')
# }
model.fit(x_train, y_train, batch_size=10, epochs=10)
#{输出训练后的embedding层权重
weights = np.array(model.get_weights())
print(f'after training {weights[0]}')
# }
输出结果
before trainig
[[1. 1. 1.]
[1. 1. 1.]
[1. 1. 1.]
[1. 1. 1.]
[1. 1. 1.]
[1. 1. 1.]
[1. 1. 1.]
[1. 1. 1.]
[1. 1. 1.]
[1. 1. 1.]
[1. 1. 1.]]
after training
[[1. 1. 1. ]
[1. 1. 1. ]
[1. 1. 1. ]
[1. 1. 1. ]
[1. 1. 1. ]
[1. 1. 1. ]
[1.021993 1.0308231 1.0193614 ]
[0.97705996 0.99987805 0.98890185]
[1.0248165 0.9592147 1.0343803 ]
[0.98026574 0.9917016 0.98594916]
[1.0376464 0.96485394 0.98285866]]
结论
观察embedding层训练前后的变换可以发现, 第0行到第5行没有被训练. 所以当训练集没有覆盖词汇表的时候, 得考虑下怎么处理了.