Attentional Factorization Machines【论文记录】

1 摘要

  • FM can be hindered by its modelling of all feature interactions with the same weight, as not all feature interactions are equally useful and predictive. For example, the interactions with useless features may even introduce noises and adversely degrade the performance.
    FM 可能会因为它以相同的权重对所有特征交互进行建模而受到阻碍,因为并不是所有的特征交互都具有同等的有用性和预测性。例如,与无用特性的交互甚至可能引入噪声并降低性能。

  • we improve FM by discriminating the importance of different feature interactions. We propose a novel model named Attentional Factorization Machine (AFM), which learns the importance of each feature interaction from data via a neural attention network
    我们通过区分不同特征交互的重要性来改进FM。我们提出了一个新的模型叫做注意因子分解机(AFM),它通过神经注意网络从数据中学习每个特征交互的重要性

2 介绍

  • To leverage the interactions between features, one common solution is to explicitly augment a feature vector with products of features (aka. cross features), as in polynomial regression (PR)
    为了利用特征之间交叉,一种常见的解决方案是用特征的乘积(又称交叉要素)来显式地扩充特征向量,如多项式回归(PR)

  • The key problem with PR is that for sparse datasets where only a few cross features are observed, the parameters for unobserved cross features cannot be estimated.
    PR的关键问题是,对于稀疏数据集,只能观测到少数交叉特征,无法估计未观测的交叉特征的参数

  • By learning an embedding vector for each feature, FM can estimate the weight for any cross feature.
    通过学习每个特征的嵌入矢量,FM可以估计任何交叉特征的权重。

  • We devise a novel model named AFM, which utilizes the recent advance in neural network modelling — the attention mechanism

  • it not only leads to better performance, but also provides insight into which feature interactions contribute more to the prediction.

3 Factorization Machines

  • Given a real valued feature vector x ∈ R n \bold{x} \in \mathbb{R}^n xRn where n denotes the number of features, FM estimates the target by modelling all interactions between each pair of features:
    y ^ F M ( x ) = w 0 + ∑ i = 1 n w i x i ⏟ linear regression  + ∑ i = 1 n ∑ j = i + 1 n w ^ i j x i x j ⏟ pair-wise feature interactions  (1) \hat{y}_{F M}(\mathbf{x})=\underbrace{w_{0}+\sum_{i=1}^{n} w_{i} x_{i}}_{\text {linear regression }}+\underbrace{\sum_{i=1}^{n} \sum_{j=i+1}^{n} \hat{w}_{i j} x_{i} x_{j}}_{\text {pair-wise feature interactions }} \tag{1} y^FM(x)=linear regression  w0+i=1nwixi+pair-wise feature interactions  i=1nj=i+1nw^ijxixj(1) w 0 w_0 w0 是全局偏差; w i w_i wi 是第 i i i 个特征的权值; w ^ i j \hat{w}_{ij} w^ij 是交叉特征 x i x j x_ix_j xixj 的权值
    w ^ i j = v i T v j (2) \hat{w}_{ij} = \bold{v}^T_i \bold{v}_j \tag{2} w^ij=viTvj(2) v ∈ R k \bold{v} \in \mathbb{R}^k vRk 是特征 i i i 的嵌入向量, k k k 是嵌入向量的大小
    还有由于 x i x j x_ix_j xixj,所有只有非零特征会有作用

  • It is worth noting that FM models all feature interactions in the same way

    • a latent vector v i v_i vi is shared in estimating all feature interactions that the i-th feature involves
    • all estimated feature interactions w ^ i j \hat{w}_{ij} w^ij have a uniform weight of 1

4 Attentional Factorization Machines

  • we omit the linear regression part in the figure, which can be trivially incorporated. we detail the pair-wise interaction layer and the attention-based pooling layer, which are the main contribution of this paper.
    我们省略了图中的线性回归部分,这部分可以简单地合并。本文的主要贡献是对交互层和基于注意的池化层进行了详细的描述。
    AFM

4.1 成对的交互层

  • It expands m vectors to m(m − 1)/2 interacted vectors, where each interacted vector is the element-wise product of two distinct vectors to encode their interaction. We can then represent the output of the pair-wise interaction layer as a set of vectors:
    它将m个向量扩展为 m(m-1)/2 个相互作用的向量,其中每个相互作用的向量都是两个不同向量的元素逐项乘积,以编码它们的相互作用。 然后,我们可以将成对交互层的输出表示为一组向量:
    f P I ( E ) = { ( v i ⊙ v j ) x i x j } ( i , j ) ∈ R x (1) f_{P I}(\mathcal{E})=\left\{\left(\mathbf{v}_{i} \odot \mathbf{v}_{j}\right) x_{i} x_{j}\right\}_{(i, j) \in \mathcal{R}_{x}} \tag{1} fPI(E)={(vivj)xixj}(i,j)Rx(1) ⊙ \odot 表示两个向量的逐元素乘积

    所有的向量两两交叉

4.2 Attention-based Pooling Layer

  • the attention mechanism is to allow different parts contribute differently when compressing them to a single representation.
    注意机制是在将不同部分压缩为单个表示时,允许不同部分做出不同的贡献。

  • we propose to employ the attention mechanism on feature interactions by performing a weighted sum on the interacted vectors:
    f A t t ( f P I ( E ) ) = ∑ ( i , j ) ∈ R x a i j ( v i ⊙ v j ) x i x j (2) f_{A t t}\left(f_{P I}(\mathcal{E})\right)=\sum_{(i, j) \in \mathcal{R}_{x}} a_{i j}\left(\mathbf{v}_{i} \odot \mathbf{v}_{j}\right) x_{i} x_{j} \tag{2} fAtt(fPI(E))=(i,j)Rxaij(vivj)xixj(2) a i j a_{ij} aij 是交叉特征 w ^ i j \hat{w}_{ij} w^ij 的注意得分,即权值

  • the attention network is defined as:
    a i j ′ = h T Re ⁡ L U ( W ( v i ⊙ v j ) x i x j + b ) a i j = exp ⁡ ( a i j ′ ) ∑ ( i , j ) ∈ R x exp ⁡ ( a i j ′ ) (3) \begin{aligned} a_{i j}^{\prime} &=\mathbf{h}^{T} \operatorname{Re} L U\left(\mathbf{W}\left(\mathbf{v}_{i} \odot \mathbf{v}_{j}\right) x_{i} x_{j}+\mathbf{b}\right) \\ a_{i j} &=\frac{\exp \left(a_{i j}^{\prime}\right)}{\sum_{(i, j) \in \mathcal{R}_{x}} \exp \left(a_{i j}^{\prime}\right)} \end{aligned} \tag{3} aijaij=hTReLU(W(vivj)xixj+b)=(i,j)Rxexp(aij)exp(aij)(3) W ∈ R t × k , b ∈ R t , h ∈ R t W \in \mathbb{R}^{t \times k},b \in \mathbb{R}^t,h \in \mathbb{R}^t WRt×k,bRt,hRt 是模型的参数, t t t 是注意网络的隐藏层大小

  • To summarize, we give the overall formulation of AFM model as:
    y ^ A F M ( x ) = w 0 + ∑ i = 1 n w i x i + p T ∑ i = 1 n ∑ j = i + 1 n a i j ( v i ⊙ v j ) x i x j \hat{y}_{A F M}(\mathbf{x})=w_{0}+\sum_{i=1}^{n} w_{i} x_{i}+\mathbf{p}^{T} \sum_{i=1}^{n} \sum_{j=i+1}^{n} a_{i j}\left(\mathbf{v}_{i} \odot \mathbf{v}_{j}\right) x_{i} x_{j} y^AFM(x)=w0+i=1nwixi+pTi=1nj=i+1naij(vivj)xixj 可以看到其实是线性模型和加入注意的 FM 模型结合

4.3 Overfitting Prevention

  • Here we consider two techniques to prevent overfitting — dropout and L2 regularization
    • The idea of dropout is randomly drop some neurons (along their connections) during training. It is shown to be capable of preventing complex co-adaptations of neurons on training data.
    • as dropout is disabled during testing and the whole network is used for prediction, dropout has another side effect of performing model averaging with smaller neural networks, which may potentially improve the performance1
    • we employ dropout on the pair-wise interaction layer to avoid co-adaptations.
    • 对于一层的 MLP,用 L 2 L_2 L2 正则化

5 相关工作

  • FM is recognized as the most effective linear embedding method for sparse data prediction

  • We point out that feature interactions are implicitly captured by a deep neural network, rather than FM that explicitly models each interaction as the inner product of two features.

6 结果

6.1 Performance Comparison

RMSE

7 总结

  • Our AFM enhances FM by learning the importance of feature interactions with an attention network,
    我们的 AFM 通过学习注意力网络的特征交叉重要性来增强FM

  • In future, we will explore deep version for AFM by stacking multiple non-linear layers above the attention-based pooling layer and see whether it can further improve the performance
    将来,我们将通过在基于注意力的池化层之上堆叠多个非线性层来探索AFM的更深版本,并查看它是否可以进一步改善性能

  • AFM has a relatively high complexity quadratic to the number of non-zero features, we will consider improving its learning efficiency, for example by using learning to hash2 3 and data sampling4 techniques.


  1. Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. JMLR, 2014. ↩︎

  2. Hanwang Zhang, Fumin Shen, Wei Liu, Xiangnan He, Huanbo Luan, and Tat-Seng Chua. Discrete collaborative filtering. In SIGIR, 2016. ↩︎

  3. Fumin Shen, Chunhua Shen, Wei Liu, and Heng Tao Shen. Supervised discrete hashing. In CVPR, 2015. ↩︎

  4. Meng Wang, Weijie Fu, Shijie Hao, Hengchang Liu, and Xindong Wu. Learning on big graph: Label inference and regularization with anchor hierarchy. IEEE TKDE, 2017. ↩︎

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值