Deep & Cross Network for Ad Click Predictions 阅读笔记

最新推荐文章于 2024-07-21 17:17:02 发布

一杯敬朝阳一杯敬月光

最新推荐文章于 2024-07-21 17:17:02 发布

阅读量254

点赞数

分类专栏：推荐系统 paper 文章标签：推荐系统 paper

本文链接：https://blog.csdn.net/qq_xuanshuang/article/details/113685549

版权

推荐系统同时被 2 个专栏收录

23 篇文章 1 订阅

订阅专栏

paper

20 篇文章 0 订阅

订阅专栏

摘要

DNN不一定能有效地学习所有类型的交叉特征。
DCN引入了一种新的交叉网络，可以更有效地学习既定阶数的特征交互。
DCN在几乎不增加模型复杂度的情况下，在每一层显式的用了特征交叉。

1.引言

识别高频的预测特征，同时探索未出现过的或者低频的的交叉特征是做出良好预测的关键。
交叉的阶数由cross network的层数决定（The cross network consists of multiple layers, where the highest-degree of interactions are provably determined by layer depth.）
普通DNN缺点（compared to our cross network it requires nearly an order of magnitude more parameters, is unable to form cross features explicitly, and may fail to efficiently learn some types of feature interactions）
- 无法显式的交叉特征
- 达到比较好的特征交叉需要参数比较多，原因可能是因为目前大多全连阶层
- 无法高效的表达一些特征的交叉

1.1 相关工作

面对数据集规模和特征维度的增加，通常采用embedding和神经网络来避免大量的人工构造特征。

FM（Factorization machines ，因子分解机）：将稀疏特征投影到低维稠密向量上，并从向量内积中学习特征交互。
FFM：每个特征学习多个向量
FMs和FFMs的浅层结构限制了它们的表达能力。
得益于embedding和非线性激活函数，深度神经网络可以表达出有意义的高阶特征交叉（Deep neural networks (DNN) are able to learn non-trivial high-degree feature interactions due to embedding vectors and nonlinear activation functions）。
残差网络（Residual Network）使我们可以训练更深的神经网络
Deep Crossing：将各类特征堆叠起来拓宽了残差网络（Deep Crossing [15] extends residual networks and achieves automatic feature learning by stacking all types of inputs.）。这边解释一下Deep Crossing，主要包括4层：
- embedding层
- Stacking层：把所有特征concate到一起
- Multiple Residual Units层：该层的主要结构是多层感知机，相比标准的以感知机为基本单元的神经网络，Deep Crossing采用了多层残差网络作为MLP的具体实现。通过残差网络对特征向量各个维度进行充分的交叉组合，使模型能够抓取到跟多的非线性特征和组合特征的信息。
- Scoring层：该层为了拟合优化目标而存在，对于CTR这类而分类问题往往采用逻辑回归模型，对于多分类往往采用softmax模型
在Kaggle竞赛中，许多获胜的方案是人工构造的低阶的特征，人工构造的特征有显式交叉，高效的特点。DNN的特征交叉是隐式的高度非线性的，这为学习比DNN更加显式更加高效的有阶特征提供了依据（ learn bounded-degree feature interactions more efficiently and explicitly than a universal DNN）。
wide & deep模型贯彻了这种精神（ learn bounded-degree feature interactions more efficiently and explicitly than a universal DNN），将交叉特征喂入线性模型，线性部分和deep部分联合训练，但是wide & deep的成功部分取决于交叉特征的正确选择，这是一个指数级问题，目前还没有高效的方法来解决。

1.2 主要贡献

DCN使特征自动交叉，可以在无需手动的构造特征、穷举搜索的情况下学习到固定阶数的、高度非线性的特征交互，并且计算量小。（In this paper, we propose the Deep & Cross Network (DCN) model that enables Web-scale automatic feature learning with both sparse and dense inputs. DCN efficiently captures effective feature interactions of bounded degrees, learns highly nonlinear interactions, requires no manual feature engineering or exhaustive searching, and has low computational cost.）

提出了一种新颖的交叉网络，在不需要人工的特征工程和穷举搜索的情况下，在每一层采用显式的特征交叉，高效的学习固定阶数的特征交互。（We propose a novel cross network that explicitly applies feature crossing at each layer, efficiently learns predictive cross features of bounded degrees, and requires no manual feature engineering or exhaustive searching.）
该交叉网络简单高效，最高的特征交叉阶数随着层深增加。（The cross network is simple yet effective. By design, the highest polynomial degree increases at each layer and is determined by layer depth. The network consists of all the cross terms of degree up to the highest, with their coefficients all different.）
该交叉网络易于实现、空间效率高（memory efficient）
实验表明，DCN可以在比DNN参数少一个数量级的情况下得到更低的logloss。（Our experimental results have demonstrated that with a cross network, DCN has lower logloss than a DNN with nearly an order of magnitude fewer number of parameters）

2.Deep & Cross NetWork（DCN）

DCN模型由一下几部分组成：

稠密特征和稀疏特征的向量的堆叠（starts with an embedding and stacking layer）
并行的交叉网络层和深度网络层（followed by a cross network and a deep network in parallel）
将两部分的输出拼接在一起进行最终的打分（These in turn are followed by a final combination layer which combines the outputs from the two networks）

2.1 embedding and stacking layer

我们将embedding向量？

2.2 cross network

该网络的重点是高效的显式特征交叉。交叉网络有交叉层组成，每层的公式如下：

$\mathbf{x}_{l+1} = \mathbf{x}_0 \mathbf{x}_l^T \mathbf{w}_l + \mathbf{b}_l + \mathbf{x}_l = f(\mathbf{x}_l, \mathbf{w}_l, \mathbf{b}_l) + \mathbf{x}_l \ \ \ \ \ \ \ (3)$

其中列向量 $\mathbf{x}_l, \mathbf{x}_{l+1} \in \mathbb{R}^d$ 表示交叉网络l层和l+1层的列向量， $\mathbf{w}_l, \mathbf{b}_l \in \mathbb{R}^d$ 是l层的权重和偏置，每层都会在特征交叉映射函数f后将本层的输入添进去，映射函数 $f: \mathbb{R}^d \mapsto \mathbb{R}^d$ 是拟合 $\mathbf{x}_{l + 1} - \mathbf{x}_l$ 的残差。

高阶特征交叉：这种特殊的结构使得特征交互的阶数随着交叉层的深度增加，对于 $\mathbf{x}_0$ 来说，其多项式阶数在第 $l$ 层达到最高阶数 $l+1$ 阶，实际上，交叉网络包含了从1到 $l+1$ 阶的所有项 $x_1^{\alpha_1}x_2^{\alpha_2}\cdots x_d^{\alpha_d}$ 。

复杂度分析，交叉网络的参数个数是 $d \times L_c \times 2$ ，其中d表示交叉网络输入的维度， $L_c$ 表示了交叉网络的层数。交叉网络的时间空间复杂度是随着输入的维度线性增长的，因此，相对于并行的传统的DNN来说交叉网络引入的复杂度可以忽略不计。得益于秩1属性 $\mathbf{x}_0 \mathbf{x}_l^T$ ，我们不需要存储整个矩阵，就可以产生所有的交叉项。（This efficiency benefits from the rank-one property of $\mathbf{x}_0 \mathbf{x}_l^T$ , which enables us to generate all cross terms without computing or storing the entire matrix.）

较小的参数数目也限制了模型的表达能力，为了捕捉高阶的非线性交互，引入了并行的deep网络。

2.3 Deep Network

deep网络是一个全连接的前馈神经网络，公式如下：

$\mathbf{h}_{l+1} = f(W_l \mathbf{h}_l + \mathbf{b}_l) \ \ \ \ \ \ (4)$

其中 $\mathbf{h}_l \in \mathbb{R}^{n_l},\mathbf{h}_{l+1} \in \mathbb{R}^{n_{l+1}}$ 是第l层和第l+1层的隐层， $W_l \in \mathbb{R}^{n_{l+1} \times n_l},\mathbf{b}_l \in \mathbb{R}^{n_{l+1}}$ 是l层的参数， $f(\cdot)$ 是ReLU激活函数。

复杂度分析：简单起见，假设deep网络的所有层的规模是一样的， $L_d$ 表明deep网络的深度， $m$ 表示每层隐藏单元的数目，则deep网络参数的个数如下 $d \times m + m + (m^2 + m) \times (L_d - 1)$ 。

2.4 Combination Layer

组合层将来自两个网络的输出串联起来，并将串联的向量送入标准logits层。二分类问题的公式如下：

$p = \sigma ([\mathbf{x}_{l_1}^T, \mathbf{h}_{L_2}^T] \mathbf{w}_{logits}) \ \ \ \ \ \ \ \ (5)$

其中 $\mathbf{x}_{L_1} \in \mathbb{R}^d, \mathbf{h}_{L_2} \in \mathbb{R}^m$ 分别表示交叉网络和deep网络的输出， $\mathbf{w}_{logits} \in \mathbb{R}^{(d+m)}$ 是Combination层的权重向量。

损失函数是带L2正则的logloss。

对cross和deep网络的联合训练，可以让他们彼此影响。（We jointly train both networks, as this allows each individual network to be aware of the others during the training.）

3 Cross NetWork Analysis

文章从多项式逼近，FMs推广，投影三方面来进行了分析。（polynomial approximation, generalization to FMs, and efficient projection.）

令：向量 $\mathbf{w}_j$ 的第i个元素表示为 $w_j^{(i)}$ ， $\mathbf{x} = [x_1, \cdots, x_d] \in \mathbb{R}^d$ 其对应的幂次为 $\mathbf{\alpha} = [\alpha_1, \cdots, \alpha_d] \in \mathbb{N}^d$ ，定义 $|\alpha| = \sum_{i=1}^d \alpha_i$ 。交叉项的阶数（多项式） $x_1^{\alpha_1}x_2^{\alpha_2}\cdots x_d^{\alpha_d}$ 是由 $|\alpha|$ 定义的，多项式的阶数由最高阶定义。

3.1 polynomial approximation

依据Weierstrass逼近定理，在一定的光滑性假设下，任何函数都可以用多项式逼近到任意精度。（By the Weierstrass approximation theorem [13], any function under certain smoothness assumption can be approximated by a polynomial to an arbitrary accuracy. ）In particular, the cross network approximates the polynomial class of the same degree in a way that is efficient, expressive and generalizes better to real-world datasets.

我们来看看交叉网络逼近同阶多项式，这类的多项式有 $O(d^n)$ 个系数，交叉网络仅用 $O(d)$ 的系数就可以表述同阶的多项式，交叉网络有同阶多项式的每一个元素，且每个元素的系数各不相同。（We show that, with only $O(d)$ parameters, the cross network contains all the cross terms occurring in the polynomial of the same degree, with each term’s coefficients distinct from each other.）

$P_n(\mathbf{x})=\{ \sum_{\alpha} w_{\alpha} x_1^{\alpha_1} x_2^{\alpha_2} \cdots x_d^{\alpha_d} \ | \ 0 \leqslant |\alpha| \leqslant n, \alpha \in \mathbb{N}^d \} \ \ \ \ \ (7)$

3.2 generalization of FMs

交叉网络与FM模型一样共享参数，并进一步扩展到更深层次的结构。在FM模型， $\mathbf{v}_i$ 是特征 $x_i$ 的权重向量， $x_ix_j$ 特征交叉的权重由 $<\mathbf{v}_i, \mathbf{v}_j>$ 计算而得。在DCN中， $x_i$ 与标量 $\{ w_k^{(i)} \}_{k=1}^l$ 有关， $x_ix_j$ 的权重来自 $\{ w_k^{(i)} \}_{k=1}^l$ 和 $\{ w_k^{(j)} \}_{k=1}^l$ 。两种模型都是每个特征独立于其他特征学习一些参数，交叉项的权重是相应参数的一定组合。（ Both models have each feature learned some parameters independent from other features, and the weight of a cross term is a certain combination of corresponding parameters.）

参数共享

提高了模型的效率（makes the model more efficient）
对训练集中未出现过的特征组合具有一定的泛化能力（enables the model to generalize to unseen feature interactions）
对噪声更加鲁棒（more robust to noise.）

举个🌰 ：考虑稀疏特征，若两个二进制特征（binary features） $x_i \neq 0 \cap x_j \neq 0$ 不曾或很少在训练集共现过，则 $x_ix_j$ 学习到的权重不会给预测带来有意义的信息。

FM受限于浅层结构，只能表示二阶的特征交互。相比较而言，DCN可以构建受限于网络深度的所有阶数的特征交互。（DCN, in contrast, is able to construct all the cross terms $x_1^{\alpha_1}x_2^{\alpha_2} \cdots x_d^{\alpha_d}$ with degree $|\alpha|$ bounded by some constant determined by layer depth）,因此，交叉网络将参数共享的思想从单层扩展到多层和高阶特征交叉。与高阶FMs不同的是，交叉网络中的参数数目只随输入维数线性增长。（Therefore, the cross network extends the idea of parameter sharing from a single layer to multiple layers and high-degree cross-terms. Note that different from the higher-order FMs, the number of parameters in a cross network only grows linearly with the input dimension.）

3.3 efficient projection

cross网络以高效的方式将 $\mathbf{x}_0$ 和 $\mathbf{x}_l$ 之间的所有特征交互投影回输入空间的维度（Each cross layer projects all the pairwise interactions between $\mathbf{x}_0$ and $\mathbf{x}_l$ , in an efficient manner, back to the input’s dimension.）

以 $\mathbf{\widetilde{x}} \in \mathbb{R}^d$ 为交叉网络的输入，每个交叉网络层先隐式的构造 $d^2$ 对特征交互 $x_i \widetilde{x}_j$ ，然后再高效隐式的将他们映射回输入维度。有一种直接的做法会带来立方的复杂度（Consider $\mathbf{\widetilde{x}} \in \mathbb{R}^d$ as the input to a cross layer. The cross layer first implicitly constructs $d^2$ pairwise interactions $x_i \widetilde{x}_j$ and then implicitly projects them back to dimension d in a memory-efficient way. A direct approach, however, comes with a cubic cost.），文章提供了一个可以将复杂度降到线性的复杂度。考虑 $\mathbf{x} _{p} = \mathbf{x} _{0} \widetilde{\mathbf{x}}^T \mathbf{w}$ ，等价于

$\mathbf{x}_p^T = [x_1\widetilde{x}_1 \cdots x_1\widetilde{x}_d \cdots x_d\widetilde{x}_1 \cdots x_d\widetilde{x}_d ] \begin{bmatrix} | & & & \\ \mathbf{w}& 0& \cdots & 0\\ | & & & \\ & |& & \\ 0 & \mathbf{w} & \cdots & 0 \\ & |& & \\ \vdots& \vdots& \vdots & \vdots \\ & & & | \\ 0& 0 &\cdots & \mathbf{w} \\ & & & | \end{bmatrix} \ \ \ \ \ (8)$