原始的tensorflow实现是通过把一个mask的位置改成一个非常小的负数,然后加到原来的向量上实现的:
adder = (1.0 - tf.cast(mask, inputs.dtype)) * (
_large_compatible_negative(inputs.dtype))
# Since we are adding it to the raw scores before the softmax, this is
# effectively the same as removing these entirely.
inputs += adder
if isinstance(self.axis, (tuple, list)):
if len(self.axis) > 1:
return tf.exp(inputs - tf.reduce_logsumexp(
inputs, axis=self.axis, keepdims=True))
else:
return backend.softmax(inputs, axis=self.axis[0])
实测这样其实mask的位置还是会参与softmax的计算,例如:
key = tf.convert_to_tensor([[1,2,3],[4,-1,-1],[3,1,-1]],dtype=K.floatx())#3,3,3
mask = tf.cast(tf.not_equal(key,-1),K.floatx())
这样算出来的softmax输出是:
[[0.09003057 0.24472848 0.66524094]
[0.9867033 0.00664835 0.00664835]
[0.8668133 0.11731042 0.01587624]]
感觉不是很完美,想了一个完美一点的mask softmax,可以把对应mask位置的softmax权重改的更小一点:
from keras import backend as K
def mask_softmax(x,mask,axis=-1):
max = K.max(x, axis=axis, keepdims=True)
e = K.exp(x - max)
e_mask = e * mask
masked_sums = K.sum(e_mask, axis=axis, keepdims=True) + K.epsilon()
mask_softmax = e_mask / masked_sums
return mask_softmax
my_softmax = mask_softmax(key,mask,axis=-1)
对应的输出:
[[0.09003057 0.24472846 0.6652409 ]
[0.9999999 0. 0. ]
[0.88079697 0.1192029 0. ]]