Attentional Factorization Machines【论文记录】

最新推荐文章于 2021-12-30 22:21:43 发布

Novelin

最新推荐文章于 2021-12-30 22:21:43 发布

阅读量248

点赞数

分类专栏：推荐系统文章标签：注意力机制过拟合处理

本文链接：https://blog.csdn.net/qq_40860934/article/details/110523585

版权

推荐系统专栏收录该内容

10 篇文章 1 订阅

订阅专栏

1 摘要

FM can be hindered by its modelling of all feature interactions with the same weight, as not all feature interactions are equally useful and predictive. For example, the interactions with useless features may even introduce noises and adversely degrade the performance.
FM 可能会因为它以相同的权重对所有特征交互进行建模而受到阻碍，因为并不是所有的特征交互都具有同等的有用性和预测性。例如，与无用特性的交互甚至可能引入噪声并降低性能。
we improve FM by discriminating the importance of different feature interactions. We propose a novel model named Attentional Factorization Machine (AFM), which learns the importance of each feature interaction from data via a neural attention network
我们通过区分不同特征交互的重要性来改进FM。我们提出了一个新的模型叫做注意因子分解机(AFM)，它通过神经注意网络从数据中学习每个特征交互的重要性

2 介绍

To leverage the interactions between features, one common solution is to explicitly augment a feature vector with products of features (aka. cross features), as in polynomial regression (PR)
为了利用特征之间交叉，一种常见的解决方案是用特征的乘积（又称交叉要素）来显式地扩充特征向量，如多项式回归（PR）
The key problem with PR is that for sparse datasets where only a few cross features are observed, the parameters for unobserved cross features cannot be estimated.
PR的关键问题是，对于稀疏数据集，只能观测到少数交叉特征，无法估计未观测的交叉特征的参数
By learning an embedding vector for each feature, FM can estimate the weight for any cross feature.
通过学习每个特征的嵌入矢量，FM可以估计任何交叉特征的权重。
We devise a novel model named AFM, which utilizes the recent advance in neural network modelling — the attention mechanism
it not only leads to better performance, but also provides insight into which feature interactions contribute more to the prediction.

3 Factorization Machines

Given a real valued feature vector $\bold{x} \in \mathbb{R}^n$ where n denotes the number of features, FM estimates the target by modelling all interactions between each pair of features:
$\hat{y}_{F M}(\mathbf{x})=\underbrace{w_{0}+\sum_{i=1}^{n} w_{i} x_{i}}_{\text {linear regression }}+\underbrace{\sum_{i=1}^{n} \sum_{j=i+1}^{n} \hat{w}_{i j} x_{i} x_{j}}_{\text {pair-wise feature interactions }} \tag{1}$ $w_0$ 是全局偏差； $w_i$ 是第 $i$ 个特征的权值； $\hat{w}_{ij}$ 是交叉特征 $x_ix_j$ 的权值
$\hat{w}_{ij} = \bold{v}^T_i \bold{v}_j \tag{2}$ $\bold{v} \in \mathbb{R}^k$ 是特征 $i$ 的嵌入向量， $k$ 是嵌入向量的大小
还有由于 $x_ix_j$ ，所有只有非零特征会有作用
It is worth noting that FM models all feature interactions in the same way
- a latent vector $v_i$ is shared in estimating all feature interactions that the i-th feature involves
- all estimated feature interactions $\hat{w}_{ij}$ have a uniform weight of 1

4 Attentional Factorization Machines

we omit the linear regression part in the figure, which can be trivially incorporated. we detail the pair-wise interaction layer and the attention-based pooling layer, which are the main contribution of this paper.
我们省略了图中的线性回归部分，这部分可以简单地合并。本文的主要贡献是对交互层和基于注意的池化层进行了详细的描述。

4.1 成对的交互层

It expands m vectors to m(m − 1)/2 interacted vectors, where each interacted vector is the element-wise product of two distinct vectors to encode their interaction. We can then represent the output of the pair-wise interaction layer as a set of vectors:
它将m个向量扩展为 m(m-1)/2 个相互作用的向量，其中每个相互作用的向量都是两个不同向量的元素逐项乘积，以编码它们的相互作用。然后，我们可以将成对交互层的输出表示为一组向量：
$f_{P I}(\mathcal{E})=\left\{\left(\mathbf{v}_{i} \odot \mathbf{v}_{j}\right) x_{i} x_{j}\right\}_{(i, j) \in \mathcal{R}_{x}} \tag{1}$ $\odot$ 表示两个向量的逐元素乘积

所有的向量两两交叉

4.2 Attention-based Pooling Layer

the attention mechanism is to allow different parts contribute differently when compressing them to a single representation.
注意机制是在将不同部分压缩为单个表示时，允许不同部分做出不同的贡献。
we propose to employ the attention mechanism on feature interactions by performing a weighted sum on the interacted vectors:
$f_{A t t}\left(f_{P I}(\mathcal{E})\right)=\sum_{(i, j) \in \mathcal{R}_{x}} a_{i j}\left(\mathbf{v}_{i} \odot \mathbf{v}_{j}\right) x_{i} x_{j} \tag{2}$ $a_{ij}$ 是交叉特征 $\hat{w}_{ij}$ 的注意得分，即权值
the attention network is defined as:
$\begin{aligned} a_{i j}^{\prime} &=\mathbf{h}^{T} \operatorname{Re} L U\left(\mathbf{W}\left(\mathbf{v}_{i} \odot \mathbf{v}_{j}\right) x_{i} x_{j}+\mathbf{b}\right) \\ a_{i j} &=\frac{\exp \left(a_{i j}^{\prime}\right)}{\sum_{(i, j) \in \mathcal{R}_{x}} \exp \left(a_{i j}^{\prime}\right)} \end{aligned} \tag{3}$ $\in \mathbb{R}^{t \times k},b \in \mathbb{R}^t,h \in \mathbb{R}^t$ 是模型的参数， $t$ 是注意网络的隐藏层大小
To summarize, we give the overall formulation of AFM model as:
$\hat{y}_{A F M}(\mathbf{x})=w_{0}+\sum_{i=1}^{n} w_{i} x_{i}+\mathbf{p}^{T} \sum_{i=1}^{n} \sum_{j=i+1}^{n} a_{i j}\left(\mathbf{v}_{i} \odot \mathbf{v}_{j}\right) x_{i} x_{j}$ 可以看到其实是线性模型和加入注意的 FM 模型结合

4.3 Overfitting Prevention

Here we consider two techniques to prevent overfitting — dropout and L2 regularization
- The idea of dropout is randomly drop some neurons (along their connections) during training. It is shown to be capable of preventing complex co-adaptations of neurons on training data.
- as dropout is disabled during testing and the whole network is used for prediction, dropout has another side effect of performing model averaging with smaller neural networks, which may potentially improve the performance¹
- we employ dropout on the pair-wise interaction layer to avoid co-adaptations.
- 对于一层的 MLP，用 $L_2$ 正则化

5 相关工作

FM is recognized as the most effective linear embedding method for sparse data prediction
We point out that feature interactions are implicitly captured by a deep neural network, rather than FM that explicitly models each interaction as the inner product of two features.

6 结果

6.1 Performance Comparison

RMSE

7 总结

Our AFM enhances FM by learning the importance of feature interactions with an attention network,
我们的 AFM 通过学习注意力网络的特征交叉重要性来增强FM
In future, we will explore deep version for AFM by stacking multiple non-linear layers above the attention-based pooling layer and see whether it can further improve the performance
将来，我们将通过在基于注意力的池化层之上堆叠多个非线性层来探索AFM的更深版本，并查看它是否可以进一步改善性能
AFM has a relatively high complexity quadratic to the number of non-zero features, we will consider improving its learning efficiency, for example by using learning to hash² ³ and data sampling⁴ techniques.

Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. JMLR, 2014. ↩︎
Hanwang Zhang, Fumin Shen, Wei Liu, Xiangnan He, Huanbo Luan, and Tat-Seng Chua. Discrete collaborative filtering. In SIGIR, 2016. ↩︎
Fumin Shen, Chunhua Shen, Wei Liu, and Heng Tao Shen. Supervised discrete hashing. In CVPR, 2015. ↩︎
Meng Wang, Weijie Fu, Shijie Hao, Hengchang Liu, and Xindong Wu. Learning on big graph: Label inference and regularization with anchor hierarchy. IEEE TKDE, 2017. ↩︎

Novelin

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Attentional Factorization Machines【论文记录】

在 FM 中加入 attention network 区别特征的重要性给交叉特征一个权值分数为了得到权值再设计一个网络线性模型和加入注意的FM结合都是可解释的特征方式
复制链接

扫一扫

专栏目录