推荐系统(十)DeepFM模型(A Factorization-Machine based Neural Network)
推荐系统系列博客:
- 推荐系统(一)推荐系统整体概览
- 推荐系统(二)GBDT+LR模型
- 推荐系统(三)Factorization Machines(FM)
- 推荐系统(四)Field-aware Factorization Machines(FFM)
- 推荐系统(五)wide&deep
- 推荐系统(六)Deep & Cross Network(DCN)
- 推荐系统(七)xDeepFM模型
- 推荐系统(八)FNN模型(FM+MLP=FNN)
- 推荐系统(九)PNN模型(Product-based Neural Networks)
DeepFM是哈工大和华为合作发表在IJCAI2017上的文章,这篇文章也是受到谷歌wide&deep模型的启发,是一个左右组合(混合)模型结构,不同的是,在wide部分用了FM模型来代替LR模型。因此,强烈建议在看这篇文章之前,先移步看完我之前写的关于wide&deep的博客: 推荐系统(五)wide&deep。我们来看看DeepFM相比较wide&deep模型的改进点及优势(前提是你已经很了解wide&deep模型了):
- 在wide部分使用FM代替了wide&deep中的LR,有了FM自动构造学习二阶(考虑到时间复杂度原因,通常都是二阶)交叉特征的能力,因此不再需要特征工程。Wide&Deep模型中LR部分依然需要人工的特征交叉,比如【用户已安装的app】与【给用户曝光的app】两个特征做交叉。另外,仅仅通过人工的手动交叉,又回到了之前在讲FM模型中提到的,比如要两个特征共现,否则无法训练。
- 在DeepFM模型中,FM模型与DNN模型共享底层embedding向量,然后联合训练。这种方式也更符合现在推荐/广告领域里多任务模型多塔共享底座embedding的方式,然后end-to-end训练得到的embedding向量也更加准确。
其实如果你很熟悉wide&deep模型,再经过上面的介绍,你基本已经知道DeepFM的大体网络结构了。接下来,本文将从两个方面介绍deepFM:
- DeepFM的模型结构细节
- DeepFM的代码实现
- 总结
一、DeepFM的模型结构细节
来看下DeepFM的模型结构(图片来自王喆《深度学习推荐系统》,ps:原论文的图不清晰,所以没有直接从原论文取图)
整体模型结构也比较简单,自底向上看分别为:
- 原始输入层:onehot编码的稀疏输入
- embedding层:FM和DNN共享的底座
- FM与DNN
- 输出层
1.1 FM
重点说下FM层,先来回顾下FM的公式:
y
^
(
x
)
=
w
0
+
∑
i
=
1
n
w
i
x
i
+
∑
i
=
1
n
∑
j
=
i
+
1
n
<
v
i
,
v
j
>
x
i
x
j
(1)
\hat{y}(x) = w_0 + \sum_{i=1}^nw_ix_i + \sum_{i=1}^n\sum_{j=i+1}^n<v_i, v_j>x_ix_j \tag{1}
y^(x)=w0+i=1∑nwixi+i=1∑nj=i+1∑n<vi,vj>xixj(1)
上面的公式分为两个部分:
1.1.1 一阶部分
w 0 + ∑ i = 1 n w i x i (2) w_0 + \sum_{i=1}^nw_ix_i \tag{2} w0+i=1∑nwixi(2)
其实就是个LR,没什么要说的。
1.1.2 二阶部分
∑ i = 1 n ∑ j = i + 1 n < v i , v j > x i x j (3) \sum_{i=1}^n\sum_{j=i+1}^n<v_i, v_j>x_ix_j \tag{3} i=1∑nj=i+1∑n<vi,vj>xixj(3)
在讲解 推荐系统(三)Factorization Machines(FM)的博客里讲过,公式(3)直接算(两两交叉算内积)时间复杂度为
O
(
N
2
)
O(N^2)
O(N2),FM论文中给出了推导,把时间复杂度降低至
O
(
K
N
)
O(KN)
O(KN):
∑
i
=
1
n
∑
j
=
i
+
1
n
<
v
i
,
v
j
>
x
i
x
j
=
1
2
[
∑
i
=
1
n
∑
j
=
1
n
<
v
i
,
v
j
>
x
i
x
j
−
∑
i
=
1
n
<
v
i
,
v
i
>
x
i
x
i
]
=
1
2
(
∑
i
=
1
n
∑
j
=
1
n
∑
f
=
1
k
v
i
,
f
⋅
v
j
,
f
x
i
x
j
−
∑
i
=
1
n
∑
f
=
1
k
v
i
,
f
⋅
v
i
,
f
x
i
x
i
)
=
1
2
∑
f
=
1
k
(
(
∑
i
=
1
n
v
i
,
f
x
i
)
(
∑
j
=
1
n
v
j
,
f
x
j
)
−
∑
i
=
1
n
v
i
,
f
2
x
i
2
)
=
1
2
∑
f
=
1
k
(
(
∑
i
=
1
n
v
i
,
f
x
i
)
2
−
∑
i
=
1
n
v
i
,
f
2
x
i
2
)
(4)
\begin{aligned} & \sum_{i=1}^{n}\sum_{j=i+1}^n<v_i, v_j>x_ix_j \\ &= \frac{1}{2} \tag {4}[\sum_{i=1}^{n}\sum_{j=1}^n<v_i, v_j>x_ix_j - \sum_{i=1}^{n}<v_i, v_i>x_ix_i] \\ &= \frac{1}{2}(\sum_{i=1}^n\sum_{j=1}^n\sum_{f=1}^kv_{i,f} \cdot v_{j,f} x_ix_j - \sum_{i=1}^n\sum_{f=1}^kv_{i,f} \cdot v_{i,f} x_ix_i) \\ &=\frac{1}{2}\sum_{f=1}^k((\sum_{i=1}^nv_{i,f}x_i)(\sum_{j=1}^nv_{j,f}x_j) - \sum_{i=1}^nv_{i,f}^2x_i^2) \\ &=\frac{1}{2}\sum_{f=1}^k((\sum_{i=1}^n v_{i,f}x_i)^2 -\sum_{i=1}^nv_{i,f}^2x_i^2) \end{aligned}
i=1∑nj=i+1∑n<vi,vj>xixj=21[i=1∑nj=1∑n<vi,vj>xixj−i=1∑n<vi,vi>xixi]=21(i=1∑nj=1∑nf=1∑kvi,f⋅vj,fxixj−i=1∑nf=1∑kvi,f⋅vi,fxixi)=21f=1∑k((i=1∑nvi,fxi)(j=1∑nvj,fxj)−i=1∑nvi,f2xi2)=21f=1∑k((i=1∑nvi,fxi)2−i=1∑nvi,f2xi2)(4)
所以,在实现的时候一般都是实现公式(4)。
1.2 DNN
没什么好讲的,多层全连接网络。
1.3 最终的输出
把FM的输出和DNN的输出想加送到sigmoid里,如下所示
y
^
=
s
i
g
m
o
i
d
(
y
F
M
+
y
D
N
N
)
\hat{y} = sigmoid(y_{FM} + y_{DNN})
y^=sigmoid(yFM+yDNN)
二、DeepFM的代码实现
这部分是本博客的重点,我这里直接用paddle官方的代码讲解下,具体代码参见:搞清楚代码细节,有助于我们对DeepFM模型更深入的了解。
2.1 数据集
这里用的是Criteo数据集,用于广告CTR预估的数据集,关于数据集的介绍参见:Criteo。特征方面,这个数据集共26个离散特征,13个连续值特征。
2.2 FM部分实现
我给代码增加了详细的注释(主要是矩阵维度的注释),大家看代码即可。
class FM(nn.Layer):
def __init__(self, sparse_feature_number, sparse_feature_dim,
dense_feature_dim, sparse_num_field):
super(FM, self).__init__()
self.sparse_feature_number = sparse_feature_number # 1000001
self.sparse_feature_dim = sparse_feature_dim # 9
self.dense_feature_dim = dense_feature_dim # 13
self.dense_emb_dim = self.sparse_feature_dim # 9
self.sparse_num_field = sparse_num_field # 26
self.init_value_ = 0.1
use_sparse = True
# sparse coding
# Embedding(1000001, 1, padding_idx=0, sparse=True)
self.embedding_one = paddle.nn.Embedding(
sparse_feature_number,
1,
padding_idx=0,
sparse=use_sparse,
weight_attr=paddle.ParamAttr(
initializer=paddle.nn.initializer.TruncatedNormal(
mean=0.0,
std=self.init_value_ /
math.sqrt(float(self.sparse_feature_dim)))))
# Embedding(1000001, 9, padding_idx=0, sparse=True)
self.embedding = paddle.nn.Embedding(
self.sparse_feature_number,
self.sparse_feature_dim,
sparse=use_sparse,
padding_idx=0,
weight_attr=paddle.ParamAttr(
initializer=paddle.nn.initializer.TruncatedNormal(
mean=0.0,
std=self.init_value_ /
math.sqrt(float(self.sparse_feature_dim)))))
# dense coding
"""
Tensor(shape=[13], dtype=float32, place=CPUPlace, stop_gradient=False,
[-0.00486396, 0.02755001, -0.01340683, 0.05218775, 0.00938804, 0.01068084, 0.00679830,
0.04791596, -0.04357519, 0.06603041, -0.02062148, -0.02801327, -0.04119579]))
"""
self.dense_w_one = paddle.create_parameter(
shape=[self.dense_feature_dim],
dtype='float32',
default_initializer=paddle.nn.initializer.TruncatedNormal(
mean=0.0,
std=self.init_value_ /
math.sqrt(float(self.sparse_feature_dim))))
# Tensor(shape=[1, 13, 9])
self.dense_w = paddle.create_parameter(
shape=[1, self.dense_feature_dim, self.dense_emb_dim],
dtype='float32',
default_initializer=paddle.nn.initializer.TruncatedNormal(
mean=0.0,
std=self.init_value_ /
math.sqrt(float(self.sparse_feature_dim))))
def forward(self, sparse_inputs, dense_inputs):
# -------------------- first order term --------------------
"""
sparse_inputs: list, length:26, list[tensor], each tensor shape: [2, 1]
dense_inputs: Tensor(shape=[2, 13]), 2 --> train_batch_size
"""
# Tensor(shape=[2, 26])
sparse_inputs_concat = paddle.concat(sparse_inputs, axis=1)
# Tensor(shape=[2, 26, 1])
sparse_emb_one = self.embedding_one(sparse_inputs_concat)
# dense_w_one: shape=[13], dense_inputs: shape=[2, 13]
# dense_emb_one: shape=[2, 13]
dense_emb_one = paddle.multiply(dense_inputs, self.dense_w_one)
# shape=[2, 13, 1]
dense_emb_one = paddle.unsqueeze(dense_emb_one, axis=2)
# paddle.sum(sparse_emb_one, 1): shape=[2, 1]
# paddle.sum(dense_emb_one, 1): shape=[2, 1]
# y_first_order: shape=[2, 1]
y_first_order = paddle.sum(sparse_emb_one, 1) + paddle.sum(
dense_emb_one, 1)
# -------------------- second order term --------------------
# Tensor(shape=[2, 26, 9])
sparse_embeddings = self.embedding(sparse_inputs_concat)
# Tensor(shape=[2, 13, 1])
dense_inputs_re = paddle.unsqueeze(dense_inputs, axis=2)
# dense_inputs_re: Tensor(shape=[2, 13, 1])
# dense_w: Tensor(shape=[1, 13, 9])
# dense_embeddings: Tensor(shape=[2, 13, 9])
dense_embeddings = paddle.multiply(dense_inputs_re, self.dense_w)
# Tensor(shape=[2, 39, 9])
feat_embeddings = paddle.concat([sparse_embeddings, dense_embeddings],
1)
# sum_square part
# Tensor(shape=[2, 9])
# \sum_{i=1}^n(v_{i,f}x_i) ---> for each embedding element: e_i, sum all feature's e_i
summed_features_emb = paddle.sum(feat_embeddings,
1) # None * embedding_size
# Tensor(shape=[2, 9]) 2-->batch_size
summed_features_emb_square = paddle.square(
summed_features_emb) # None * embedding_size
# square_sum part
# Tensor(shape=[2, 39, 9])
squared_features_emb = paddle.square(
feat_embeddings) # None * num_field * embedding_size
# Tensor(shape=[2, 9]) 2-->batch_size
squared_sum_features_emb = paddle.sum(squared_features_emb,
1) # None * embedding_size
# Tensor(shape=[2, 1])
y_second_order = 0.5 * paddle.sum(
summed_features_emb_square - squared_sum_features_emb,
1,
keepdim=True) # None * 1
return y_first_order, y_second_order, feat_embeddings
2.3 DNN部分实现
这部分着实没什么好说的,直接略过。
class DNN(paddle.nn.Layer):
def __init__(self, sparse_feature_number, sparse_feature_dim,
dense_feature_dim, num_field, layer_sizes):
super(DNN, self).__init__()
self.sparse_feature_number = sparse_feature_number
self.sparse_feature_dim = sparse_feature_dim
self.dense_feature_dim = dense_feature_dim
self.num_field = num_field
self.layer_sizes = layer_sizes
# [351, 512, 256, 128, 32, 1]
sizes = [sparse_feature_dim * num_field] + self.layer_sizes + [1]
acts = ["relu" for _ in range(len(self.layer_sizes))] + [None]
self._mlp_layers = []
for i in range(len(layer_sizes) + 1):
linear = paddle.nn.Linear(
in_features=sizes[i],
out_features=sizes[i + 1],
weight_attr=paddle.ParamAttr(
initializer=paddle.nn.initializer.Normal(
std=1.0 / math.sqrt(sizes[i]))))
self.add_sublayer('linear_%d' % i, linear)
self._mlp_layers.append(linear)
if acts[i] == 'relu':
act = paddle.nn.ReLU()
self.add_sublayer('act_%d' % i, act)
def forward(self, feat_embeddings):
"""
feat_embeddings: Tensor(shape=[2, 39, 9])
"""
# Tensor(shape=[2, 351]) --> 351=39*9,
# 39 is the number of features(category feature+ continous feature), 9 is embedding size
y_dnn = paddle.reshape(feat_embeddings,
[-1, self.num_field * self.sparse_feature_dim])
for n_layer in self._mlp_layers:
y_dnn = n_layer(y_dnn)
return y_dnn
2.4 FM和DNN部分结合
def forward(self, sparse_inputs, dense_inputs):
y_first_order, y_second_order, feat_embeddings = self.fm.forward(
sparse_inputs, dense_inputs)
# feat_embeddings: Tensor(shape=[2, 39, 9])
# y_dnn: Tensor(shape=[2, 1])
y_dnn = self.dnn.forward(feat_embeddings)
print("y_dnn:", y_dnn)
predict = F.sigmoid(y_first_order + y_second_order + y_dnn)
return predict
三、总结
总得来说,DeepFM还是一个挺不错的模型,在工业界应用的也挺多。还是那句话,如果你的场景下之前是LR,正在往深度学习迁移,为了最大化节约成本,可以尝试下wide&deep模型。如果原来是xgboost一类的树模型,需要尝试深度学习模型,建议直接deepFM。
此外,deepFM与DCN都是在2017年发表的,因此这两篇paper里均没有直接有过实验数据对比,但在DCN V2里给出了实验效果对比,在论文给定的数据集下,两个模型效果差不多。详情参见:DCN V2。具体的实验效果需要大家在各自的场景下做实验观察。
参考文献