目录
1:研究背景
FM算法本身通过引入二阶feature interactions来提高线性回归模型的泛化表达能力,但它以相同的权重来对所有的特征组合进行建模。事实上很多无用特征的组合会引入噪声从而影响效果。基于这个背景下,论文提出Attentional Factorization Machine(AFM),通过neural attention network来学习每个特征组合的重要性。AFM和NFM一样也是一个串行的FM&DNN结构。在进行预测时,FM会让一个特征固定一个特定的向量,当这个特征与其他特征做交叉时,都是用同样的向量去做计算。这个是很不合理的,因为不同的特征之间的交叉,重要程度是不一样的。如何体现这种重要程度,之前介绍的FFM模型是一个方案。另外,结合了attention机制的AFM模型,也是一种解决方案。本文是解决这个问题的另一种思路,也就是对不同的特征组合赋予不同的权值,而且这个权值是可学习的,体现了模型对不同的特征组合的关注度不同。我的理解是对与最后的分类贡献程度较高的特征组合,会赋予较高的权值。
2:网络结构
![]() |
公式:
3:代码实战
这里这贴出attention的实现:
with tf.name_scope('Pair-wise_Interaction_Layer'):
pair_wise_product_list = []
for i in range(self.field_size):
for j in range(i + 1, self.field_size):
pair_wise_product_list.append(
tf.multiply(self.embeddings[:, i, :], self.embeddings[:, j, :])) # [None, embedding_size]
self.pair_wise_product = tf.stack(pair_wise_product_list) # [embedding_size*(embedding_size + 1)/2, None, embedding_size]
self.pair_wise_product = tf.transpose(self.pair_wise_product, perm=[1, 0, 2],name='pair_wise_product') # [None, field_size*(field_size - 1)/2, embedding_size]
self.pair_wise_product = tf.nn.dropout(self.pair_wise_product, self.dropout_keep_fm[1])
with tf.name_scope('attention_net'):
glorot = np.sqrt(2.0 / (self.attention_size + self.embedding_size))
weights['attention_w'] = tf.Variable(
np.random.normal(loc=0, scale=glorot, size=(self.embedding_size, self.attention_size)),
dtype=tf.float32, name='attention_w')
biases['attention_b'] = tf.Variable(
np.random.normal(loc=0, scale=glorot, size=(1, self.attention_size)),
dtype=tf.float32, name='attention_b')
weights['attention_h'] = tf.Variable(np.random.normal(loc=0, scale=1, size=(1, self.attention_size)),
dtype=tf.float32, name='attention_h')
weights['attention_p'] = tf.Variable(np.random.normal(loc=0, scale=1, size=(self.embedding_size, 1)),
dtype=tf.float32, name='attention_p') # 若p全为1,则直接表示FM二阶项
num_interactions = self.pair_wise_product.shape.as_list()[1]
# w*x + b
self.attention_wx_plus_b = tf.add(
tf.matmul(tf.reshape(self.pair_wise_product, shape=[-1, self.embedding_size]),
weights['attention_w']), biases['attention_b'])
self.attention_wx_plus_b = tf.reshape(self.attention_wx_plus_b, shape=[-1, num_interactions,
self.attention_size]) # [None, field_size*(field_size - 1)/2, attention_size]
# relu(w*x + b)
self.attention_relu_wx_plus_b = tf.nn.relu(
self.attention_wx_plus_b) # [None, field_size*(field_size - 1)/2, attention_size]
# h*relu(w*x + b)
self.attention_h_mul_relu_wx_plus_b = tf.multiply(self.attention_relu_wx_plus_b, weights[
'attention_h']) # [None, field_size*(field_size - 1)/2, attention_size]
# exp(h*relu(w*x + b))
self.attention_exp = tf.exp(tf.reduce_sum(self.attention_h_mul_relu_wx_plus_b, axis=2,
keep_dims=True)) # [None, field_size*(field_size - 1)/2, 1]
# sum(exp(h*relu(w*x + b)))
self.attention_exp_sum = tf.reduce_sum(self.attention_exp, axis=1, keep_dims=True) # [None, 1, 1]
# exp(h*relu(w*x + b)) / sum(exp(h*relu(w*x + b)))
self.attention_out = tf.div(self.attention_exp, self.attention_exp_sum,
name='attention_out') # [None, field_size*(field_size - 1)/2, 1]
# attention*Pair-wise
self.attention_product = tf.multiply(self.attention_out,
self.pair_wise_product) # [None, field_size*(field_size - 1)/2, embedding_size]
self.attention_product = tf.reduce_sum(self.attention_product, axis=1) # [None, embedding_size]
# p*attention*Pair-wise
self.attention_net_out = tf.matmul(self.attention_product, weights['attention_p']) # [None, 1]
if self.batch_norm:
self.attention_net_out = self.batch_norm_layer(self.attention_net_out, train_phase=self.train_phase,
scope_bn='bn1')
with tf.name_scope('out'):
self.out = tf.add_n(
[self.w0, self.linear_out, self.attention_net_out]) # # yAFM = w0 + wx + attention(x)