前文:
洪九(李戈):策略算法工程师之路-深度模型常见结构以及实现(Tensorflow)-1
目录
6.Attention(注意力机制)
6.1 简单Attention网络
6.2 Multi-headed attention(多头注意力网络)
6.3 Self-Attention自注意力网络
7.特征交叉网络
7.1 FM交叉网络
7.2 Cross交叉网络
7.3 CIN交叉网络
8.正则化
8.1 L2正则化
8.2 L1正则化
8.3 Group Lasso
9.多任务学习
9.1 简单多任务
9.2 专家网络
10.特征缺失处理
11.常用底层操作
6.Attention(注意力机制)
注意力机制(AttentionMechanism)作为一种资源分配方案,将有限的计算资源用来处理更重要的信息,是解决信息超载问题的主要手段。
注意力机制分为如下两步:
- Step1:在所有输入信息上计算注意力分布。
给定任务相关的查询向量
加性模型:
点积模型:
缩放点积模型:
双线性模型:
- Step2:根据注意力分布来计算输入信息的加权平均。
6.1 简单Attention网络
上图是最基本的注意力网络,可见于AFM、DIN等网络。以DIN网络为例:
Activation Unit的结构为:
示例代码如下:
def attention(queries, keys, keys_length):
"""
queries: [B, H] 前面的B代表的是batch_size,H代表向量维度。
keys: [B, T, H] T是一个batch中,当前特征最大的长度,每个样本代表一个样本的特征
keys_length: [B]
"""
# H 每个query词的隐藏层神经元是多少,也就是H
queries_hidden_units = queries.get_shape().as_list()[-1]
# tf.tile为复制函数,1代表在B上保持一致,tf.shape(keys)[1] 代表在H上复制这么多次,
# 那么queries最终shape为(B, H*T)
queries = tf.tile(queries, [1, tf.shape(keys)[1]])
# queries.shape(B,T,H) 其中每个元素(T,H)代表T行H列,其中每个样本中,每一行的数据都是一样的
queries = tf.reshape(queries, [-1, tf.shape(keys)[1], queries_hidden_units])
# 下面4个变量的shape都是(B, T, H),按照最后一个维度concat,所以shape是(B, T, H*4),
# 在这块就将特征中的每个item和目标item连接在了一起
din_all = tf.concat([queries, keys, queries - keys, queries * keys], axis=-1)
# (B, T, 80)
d_layer_1_all = tf.layers.dense(din_all, 80, activation=tf.nn.sigmoid, name='f1_att', reuse=tf.AUTO_REUSE)
# (B, T, 40)
d_layer_2_all = tf.layers.dense(d_layer_1_all, 40, activation=tf.nn.sigmoid, name='f2_att', reuse=tf.AUTO_REUSE)
# (B, T, 1)
d_layer_3_all = tf.layers.dense(d_layer_2_all, 1, activation=None, name='f3_att', reuse=tf.AUTO_REUSE)
# (B, 1, T)
# 每一个样本都是 [1,T] 的维度,和原始特征的维度一样
# 但是这时候每个item已经是特征中的一个item和目标item混在一起的数值了
d_layer_3_all = tf.reshape(d_layer_3_all, [-1, 1, tf.shape(keys)[1]])
outputs = d_layer_3_all
# Mask,每一行都有T个数字,keys_length长度为B,假设第1 2个数字是5,6,那么key_masks第1 2行的前5 6个数字为True
key_masks = tf.sequence_mask(keys_length, tf.shape(keys)[1]) # [B, T]
key_masks = tf.expand_dims(key_masks, 1) # [B, 1, T]
# 创建一个和outputs的shape保持一致的变量,值全为1,再乘以(-2 ** 32 + 1),所以每个值都是(-2 ** 32 + 1)
paddings = tf.ones_like(outputs) * (-2 ** 32 + 1)
outputs = tf.where(key_masks, outputs, paddings) # [B, 1, T]
# Scale 根据特征数目来做拉伸
outputs = outputs / (keys.get_shape().as_list()[-1] ** 0.5)
# Activation
outputs = tf.nn.softmax(outputs) # [B, 1, T]
# 加权求和
outputs = tf.matmul(outputs, keys) # [B, 1, H]
return outputs
Seq2Seq中的注意力机制:
import tensorflow as tf
def attention(H):
# H:[batch_size, time_step, hidden_size]
H_shape = H.shape.as_list()
time_step, hidden_size = H_shape[1], H_shape[2]
h_t = tf.Variable(tf.truncated_normal(shape=[hidden_size, 1], stddev=0.5, dtype=tf.float32))
# W:[hidden_size, hidden_size]
W = tf.Variable(tf.truncated_normal(shape=[hidden_size, hidden_size],
stddev=0.5, dtype=tf.float32))
# score: [batch_size*time_step, 1]
score = tf.matmul(tf.matmul(tf.reshape(H, [-1,hidden_size]), W), h_t)
# score: [batch_size, time_step, 1]
score = tf.reshape(score,[-1, time_step, 1])
# alpha:[batch_size, time_step, 1]
alpha = tf.nn.softmax(score)
c_t = tf.matmul(tf.transpose(H, [0, 2, 1]), alpha)
return tf.tanh(c_t)
6.2 Multi-headed attention(多头注意力网络)
Scaled dot-Product attention就是我们常用的使用点积进行相似度计算的attention。
Multi-head attention(多头Head)结构如上图,
参考代码如下:
def scaled_dot_product_attention(query, key, value, mask):
matmul_qk = tf.matmul(query, key, transpose_b=True)
depth = tf.cast(tf.shape(key)[-1], tf.float32)
logits = matmul_qk / tf.math.sqrt(depth)
# add the mask zero out padding tokens.
if mask is not None:
logits += (mask * -1e9)
attention_weights = tf.nn.softmax(logits, axis=-1)
return tf.matmul(attention_weights, value)
多头部分:
# 多头注意力网络
def multihead_attention(queries, keys, values,
num_heads=8,
dropout_rate=0,
training=True,
causality=False,
scope="multihead_attention"):
'''
这里是将不同的Queries、Keys和values方式线性地投影h次是有益的。
线性投影分别为dk,dk和dv尺寸。在每个预计版本进行queries、keys、values,
然后并行执行attention功能,产生dv维输出值。这些被连接并再次投影,产生最终值
:param queries: 三维张量[N, T_q, d_model]
:param keys : 三维张量[N, T_k, d_model]
:param values : 三维张量[N, T_k, d_model]
:param num_heads: heads数
:param dropout_rate:
:param training : 控制dropout机制
:param causality: 控制是否遮盖
:param scope:
:return: 三维张量(N, T_q, C)
'''
d_model = queries.get_shape().as_list()[-1]
with tf.variable_scope(scope, reuse=tf.AUTO_REUSE):
# Linear projections
Q = tf.layers.dense(queries, d_model, use_bias=False) # (N, T_q, d_model)
K = tf.layers.dense(keys, d_model, use_bias=False) # (N, T_k, d_model)
V = tf.layers.dense(values, d_model, use_bias=False) # (N, T_k, d_model)
# Split and concat
Q_ = tf.concat(tf.split(Q, num_heads, axis=2), axis=0) # (h*N, T_q, d_model/h)
K_ = tf.concat(tf.split(K, num_heads, axis=2), axis=0) # (h*N, T_k, d_model/h)
V_ = tf.concat(tf.split(V, num_heads, axis=2), axis=0) # (h*N, T_k, d_model/h)
# Attention
outputs = scaled_dot_product_attention(Q_, K_, V_, causality, dropout_rate, training)
# Restore shape
outputs = tf.concat(tf.split(outputs, num_heads, axis=0), axis=2 ) # (N, T_q, d_model)
# Residual connection
outputs += queries
# Normalize
outputs = ln(outputs)
return outputs
6.3 Self-Attention自注意力网络
Self Attention机制在
7.特征交叉网络
显式的特征交叉是对深度模型隐式特征交叉的补充,在实践中有着广泛的应用。常见的特征交叉结构有FM、Cross网络、CIN等。
7.1 FM交叉网络
FM交叉网络实现的是特征间的二阶交叉,典型代表DeepFM。
特征交叉计算公式如下:
示例代码:
def second_order_part(self, sparse_id, sparse_value):
with tf.variable_scope("second-order"):
V = tf.get_variable("weight",(self.feature_size, self.factor_size),
initializer=tf.random_normal_initializer(0.0, 0.01))
self.embeddings = tf.nn.embedding_lookup(V, sparse_id)
# None * F * K
self.embeddings = tf.multiply(self.embeddings, sparse_value)
# 和平方:None * K
sum_squared_part = tf.square(tf.reduce_sum(self.embeddings, 1))
# 平方和:None * K
squared_sum_part = tf.reduce_sum(tf.square(self.embeddings), 1)
y_second_order = 0.5 * tf.subtract(sum_squared_part,
squared_sum_part)
return y_second_order
7.2 Cross交叉网络
Cross交叉网络可以实现特征间任意阶交叉(实践中根据需求确定 layer depth),典型的用于DCN模型中。参考代码如下:
def cross_layer(x0, x, name):
with tf.variable_scope(name):
input_dim = x0.get_shape().as_list()[1]
w = tf.get_variable("weight", [input_dim], initializer
=tf.truncated_normal_initializer(stddev=0.01))
b = tf.get_variable("bias", [input_dim], initializer
=tf.truncated_normal_initializer(stddev=0.01))
xb = tf.tensordot(tf.reshape(x, [-1, 1, input_dim]), w, 1)
return x0 * xb + b + x
这里有个trick,改变下运算顺序(矩阵运算结合律)可以极大的提高网络性能,如下图:
Cross特殊的网络结构使得cross feature的阶数随着layer depth的增加而增加。相对于输入
参考资料:
杨旭东:玩转企业级Deep&Cross Network模型你只差一步zhuanlan.zhihu.com7.3 CIN交叉网络
CIN(Compressed Interaction Network)对Cross网络做了改进,CIN是显示的高阶特征交互,并且是vector-wise level,进一步提高了特征交互的能力。计算过程如下:
- 外积操作
首先引入过渡张量
- 特征压缩
在这一步将上面计算得到的
- 特征拼接
对得到的
参考代码如下:
# Embedding向量维度
D = Config.embedding_size
final_result = []
final_len = 0
# 原始DNN输入特征
nn_input = tf.reshape(dnn_input,
shape=[-1, self.field_size, Config.embedding_size])
# 缓存CIN各层
cin_layers = [nn_input]
field_nums = [self.field_size]
# 将原始输入从最后一个维度切分成D列
split_tensor_0 = tf.split(nn_input, D * [1], 2)
# 循环建立多层CIN网络
for idx, layer_size in enumerate(Config.cross_layer_size):
# 将最新层从最后一个维度切分成D列
now_tensor = tf.split(cin_layers[-1], D * [1], 2)
# 外积操作 H_{k} x m
dot_result_m = tf.matmul(split_tensor_0, now_tensor, transpose_b=True)
# 构造Z,这里为Z^{样本数 x D x (H_{k}*m) }
dot_result_o = tf.reshape(dot_result_m, shape=[D, -1, field_nums[0] * field_nums[-1]])
dot_result = tf.transpose(dot_result_o, perm=[1, 0, 2])
# 特征压缩(构造滤波器->卷积操作->激活函数->交换维度)
filters = tf.get_variable(name="f_" + str(idx), shape=[1, field_nums[-1] * field_nums[0], layer_size],
dtype=tf.float32)
curr_out = tf.nn.conv1d(dot_result, filters=filters, stride=1, padding='VALID')
b = tf.get_variable(name="f_b" + str(idx), shape=[layer_size], dtype=tf.float32,
initializer=tf.zeros_initializer())
curr_out = tf.nn.relu(tf.nn.bias_add(curr_out, b))
curr_out = tf.transpose(curr_out, perm=[0, 2, 1])
if Config.cross_direct:
direct_connect = curr_out
next_hidden = curr_out
final_len += layer_size
field_nums.append(int(layer_size))
else:
if idx != len(Config.cross_layer_size) - 1:
next_hidden, direct_connect = tf.split(curr_out, 2 * [int(layer_size / 2)], 1)
final_len += int(layer_size / 2)
else:
direct_connect = curr_out
next_hidden = 0
final_len += layer_size
field_nums.append(int(layer_size / 2))
# 保存最新层
final_result.append(direct_connect)
cin_layers.append(next_hidden)
# 特征拼接(Sum Pooling)
result = tf.concat(final_result, axis=1)
result = tf.reduce_sum(result, -1)
# 特征拼接
w_nn_output1 = tf.get_variable(name='w_nn_output1', shape=[final_len, Config.cross_output_size],
dtype=tf.float32)
b_nn_output1 = tf.get_variable(name='b_nn_output1', shape=[Config.cross_output_size],dtype=tf.float32,
initializer=tf.zeros_initializer())
CIN_out = tf.nn.xw_plus_b(result, w_nn_output1, b_nn_output1)
8.正则化
正则化可以帮助我们惩罚特征权重,即特征的权重也会成为模型损失函数的一部分。可以理解为, 为了使用某个特征,我们需要付出loss的代价,除非这个特征非常有效,否则就会被loss上的增加覆盖效果。这样我们就能筛选出最有效的特征,减少特征权重防止过拟合。一般来说,L1正则会制造稀疏的特征,大部分无用特征的权重会被至为0,L2正则会让特征的权重不过大,使得特征的权重比较平均。
8.1 L2正则化
首先,在声明权重变量时,将正则化损失添加到特定集合中:
def get_weights(shape, weight_decay=0.0, dtype=tf.float32, trainable=True):
"""
add weight regularization to loss collection
Args:
shape:
weight_decay:
dtype:
trainable:
Returns:
"""
weight = tf.Variable(initial_value=tf.truncated_normal(shape=shape, stddev=0.01),
name='Weights', dtype=dtype, trainable=trainable)
if weight_decay > 0:
# 第一步:计算正则化损失
weight_loss = tf.nn.l2_loss(weight) * weight_decay
# 第二步:将正则化损失添加到特定集合中(tensorflow内置集合或自定义集合)
tf.add_to_collection(tf.GraphKeys.REGULARIZATION_LOSSES, value = weight_loss)
else:
pass
return weight
然后,在计算总损失时累加权重本身的损失和正则化损失:
with tf.variable_scope("loss"):
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=input_label_placeholder,name='entropy')
loss_op = tf.reduce_mean(input_tensor=cross_entropy, name='loss')
weight_loss_op = tf.losses.get_regularization_losses()
weight_loss_op = tf.add_n(weight_loss_op)
total_loss_op = loss_op + weight_loss_op
最后,将损失给到优化器:
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
with tf.Session(config=config) as sess:
sess.run(init_op)
input_data, input_label = sess.run([data_batch, label_batch])
# 优化器
train_op = tf.train.GradientDescentOptimizer(learning_rate=LEARNING_RATE).minimize(loss=total_loss_op,
global_step=global_step)
feed_dict = {input_data_placeholder:input_data, input_label_placeholder:input_label}
_, total_loss, loss, weight_loss = sess.run([train_op, total_loss_op, loss_op, weight_loss_op],
feed_dict=feed_dict)
参考资料:
我继续:tensorflow损失函数加上正则化
8.2 L1正则化
代码同上。
从图上可以看出L2正则实际上就是做了一个放缩,而L1正则实际是做了一个soft thresholding,把很多权重项置0了,所以就得到了稀疏的结果。
8.3 Group Lasso
Yuan在2006年将lasso方法推广到group上面,诞生了group lasso。我们可以将所有变量分组,然后在目标函数中惩罚每一组的L2范数,这样达到的效果就是可以将一整组的系数同时消成零,即抹掉一整组的变量,这种方法叫做Group Lasso 分组最小角回归算法。容易看出,group lasso是对lasso的一种推广,即将特征分组后的lasso,如果每个组的特征个数都是1,则group lasso就回归到原始的lasso。其目标函数为:
def group_lasso(alpha,scale,groups):
# groups_size=len(set(groups)) # number of groups
all_variables=tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES)
all_weights=[x for x in all_variables if re.search("weights",x.name)]
weights1=[x for x in all_weights if re.search("hidden1/weights",x.name)][0]
weights_others=[x for x in all_weights if not re.search("hidden1/weights",x.name)]
# group lasso regularization for the input-hidden1 weights
regularizer = tf.contrib.layers.l2_regularizer(scale=scale)
rg=0.0
group_index=0
for group_id in list(set(groups)):
this_group_mask=[i for i,x in enumerate(groups) if x== group_id]
pl=len(this_group_mask)
rg+=math.sqrt(pl)*tf.sqrt(regularizer(tf.gather(weights1,tf.to_int64(this_group_mask))))
regularizer2 = tf.contrib.layers.l1_regularizer(scale=scale)
if(alpha==1):
pass
else:
rg=rg*alpha+(1-alpha)*tf.contrib.layers.apply_regularization(regularizer2, weights_others)
return rg
def sparse_group_lasso(alpha,scale,groups):
all_variables=tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES)
all_weights=[x for x in all_variables if re.search("weights",x.name)]
weights1=[x for x in all_weights if re.search("hidden1/weights",x.name)][0]
# group lasso regularization for the input-hidden1 weights
regularizer = tf.contrib.layers.l2_regularizer(scale=scale)
rg=0.0
group_index=0
for group_id in list(set(groups)):
this_group_mask=[i for i,x in enumerate(groups) if x== group_id]
pl=len(this_group_mask)
rg+=math.sqrt(pl)*tf.sqrt(regularizer(tf.gather(weights1,tf.to_int64(this_group_mask))))
regularizer2 = tf.contrib.layers.l1_regularizer(scale=scale)
rg=alpha*rg+(1-alpha)*tf.contrib.layers.apply_regularization(regularizer2, all_weights)
return rg
9.多任务学习
多任务学习的重要特点是单个输入
9.1 简单多任务
import Tensorflow as tf
# 定义占位符
X = tf.placeholder("float", [10, 10], name="X")
Y1 = tf.placeholder("float", [10, 20], name="Y1")
Y2 = tf.placeholder("float", [10, 20], name="Y2")
# 权重定义
initial_shared_layer_weights = np.random.rand(10,20)
initial_Y1_layer_weights = np.random.rand(20,20)
initial_Y2_layer_weights = np.random.rand(20,20)
shared_layer_weights = tf.Variable(initial_shared_layer_weights, name="share_W", dtype="float32")
Y1_layer_weights = tf.Variable(initial_Y1_layer_weights, name="share_Y1", dtype="float32")
Y2_layer_weights = tf.Variable(initial_Y2_layer_weights, name="share_Y2", dtype="float32")
# 使用relu激活函数构建层
shared_layer = tf.nn.relu(tf.matmul(X,shared_layer_weights))
Y1_layer = tf.nn.relu(tf.matmul(shared_layer,Y1_layer_weights))
Y2_layer = tf.nn.relu(tf.matmul(shared_layer,Y2_layer_weights))
# 计算loss
Y1_Loss = tf.nn.l2_loss(Y1-Y1_layer)
Y2_Loss = tf.nn.l2_loss(Y2-Y2_layer)
交替训练:
# 优化器
Y1_op = tf.train.AdamOptimizer().minimize(Y1_Loss)
Y2_op = tf.train.AdamOptimizer().minimize(Y2_Loss)
with tf.Session() as session:
session.run(tf.initialize_all_variables())
for iters in range(10):
if np.random.rand() < 0.5:
_, Y1_loss = session.run([Y1_op, Y1_Loss],
{
X: np.random.rand(10,10)*10,
Y1: np.random.rand(10,20)*10,
Y2: np.random.rand(10,20)*10
})
print(Y1_loss)
else:
_, Y2_loss = session.run([Y2_op, Y2_Loss],
{
X: np.random.rand(10,10)*10,
Y1: np.random.rand(10,20)*10,
Y2: np.random.rand(10,20)*10
})
print(Y2_loss)
联合训练:
# 计算Loss
Y1_Loss = tf.nn.l2_loss(Y1-Y1_layer)
Y2_Loss = tf.nn.l2_loss(Y2-Y2_layer)
# 两个loss相加(核心)
Joint_Loss = Y1_Loss + Y2_Loss
# 优化器
Optimiser = tf.train.AdamOptimizer().minimize(Joint_Loss)
# 联合训练
with tf.Session() as session:
session.run(tf.initialize_all_variables())
_, Joint_Loss = session.run([Optimiser, Joint_Loss],
{
X: np.random.rand(10,10)*10,
Y1: np.random.rand(10,20)*10,
Y2: np.random.rand(10,20)*10
})
print(Joint_Loss)
参考代码:
https://github.com/jg8610/multi-task-part-1-notebook/blob/master/Multi-Task%20Learning%20Tensorflow%20Part%201.ipynbgithub.com https://github.com/jg8610/multi-task-learning/blob/master/graph.pygithub.com9.2 专家网络
MMoE,Multi-gate Mixture-of-Experts。对于不同的任务(CTR、CVR等),模型的权重选择是不同的,为此作者为每个任务都配备一个
共享变量(Shared Bottom):
def model_fn(features, labels, mode, params):
tf.set_random_seed(2019)
cont_feats = features["cont_feats"]
cate_feats = features["cate_feats"]
vector_feats = features["vector_feats"]
single_cate_feats = cate_feats[:, 0:params.cate_field_size]
multi_cate_feats = cate_feats[:, params.cate_field_size:]
cont_feats_index = tf.Variable([[i for i in range(params.cont_field_size)]], trainable=False, dtype=tf.int64,
name="cont_feats_index")
cont_index_add = tf.add(cont_feats_index, params.cate_feats_size)
index_max_size = params.cont_field_size + params.cate_feats_size
feats_emb = my_layer.emb_init(name='feats_emb', feat_num=index_max_size, embedding_size=params.embedding_size)
# cont_feats -> Embedding
with tf.name_scope("cont_feat_emb"):
ori_cont_emb = tf.nn.embedding_lookup(feats_emb, ids=cont_index_add, name="ori_cont_emb")
cont_value = tf.reshape(cont_feats, shape=[-1, params.cont_field_size, 1], name="cont_value")
cont_emb = tf.multiply(ori_cont_emb, cont_value)
cont_emb = tf.reshape(cont_emb, shape=[-1, params.cont_field_size * params.embedding_size], name="cont_emb")
# single_category -> Embedding
with tf.name_scope("single_cate_emb"):
cate_emb = tf.nn.embedding_lookup(feats_emb, ids=single_cate_feats)
cate_emb = tf.reshape(cate_emb, shape=[-1, params.cate_field_size * params.embedding_size])
# multi_category -> Embedding
with tf.name_scope("multi_cate_emb"):
multi_cate_emb = my_layer.multi_cate_emb(params.multi_feats_range, feats_emb, multi_cate_feats)
# deep input dense
dense_input = tf.concat([cont_emb, vector_feats, cate_emb, multi_cate_emb], axis=1, name='dense_vector')
专家网络:
def model_fn(features, labels, mode, params):
... 共享变量部分
# deep input dense
dense_input = tf.concat([cont_emb, vector_feats, cate_emb, multi_cate_emb],
axis=1, name='dense_vector')
# experts
experts_weight = tf.get_variable(name='experts_weight',
dtype=tf.float32,
shape=(dense_input.get_shape()[1], params.experts_units, params.experts_num),
initializer=tf.contrib.layers.xavier_initializer())
experts_bias = tf.get_variable(name='expert_bias',
dtype=tf.float32,
shape=(params.experts_units, params.experts_num),
initializer=tf.contrib.layers.xavier_initializer())
# f_{i}(x) = activation(W_{i} * x + b)
experts_output = tf.tensordot(dense_input, experts_weight, axes=1)
use_experts_bias = True
if use_experts_bias:
experts_output = tf.add(experts_output, experts_bias)
experts_output = tf.nn.relu(experts_output)
门控(gate)网络:
def model_fn(features, labels, mode, params):
...
# gates
gate1_weight = tf.get_variable(name='gate1_weight',
dtype=tf.float32,
shape=(dense_input.get_shape()[1], params.experts_num),
initializer=tf.contrib.layers.xavier_initializer())
gate1_bias = tf.get_variable(name='gate1_bias',
dtype=tf.float32,
shape=(params.experts_num,),
initializer=tf.contrib.layers.xavier_initializer())
gate2_weight = tf.get_variable(name='gate2_weight',
dtype=tf.float32,
shape=(dense_input.get_shape()[1], params.experts_num),
initializer=tf.contrib.layers.xavier_initializer())
gate2_bias = tf.get_variable(name='gate2_bias',
dtype=tf.float32,
shape=(params.experts_num,),
initializer=tf.contrib.layers.xavier_initializer())
# g^{k}(x) = activation(W_{gk} * x + b), where activation is softmax according to the paper
gate1_output = tf.matmul(dense_input, gate1_weight)
gate2_output = tf.matmul(dense_input, gate2_weight)
user_gate_bias = True
if user_gate_bias:
gate1_output = tf.add(gate1_output, gate1_bias)
gate2_output = tf.add(gate2_output, gate2_bias)
gate1_output = tf.nn.softmax(gate1_output)
gate2_output = tf.nn.softmax(gate2_output)
...
多任务融合:
def model_fn(features, labels, mode, params):
...
# f^{k}(x) = sum_{i=1}^{n}(g^{k}(x)_{i} * f_{i}(x))
label1_input = tf.multiply(experts_output, tf.expand_dims(gate1_output, axis=1))
label1_input = tf.reduce_sum(label1_input, axis=2)
label1_input = tf.reshape(label1_input, [-1, params.experts_units])
label2_input = tf.multiply(experts_output, tf.expand_dims(gate2_output, axis=1))
label2_input = tf.reduce_sum(label2_input, axis=2)
label2_input = tf.reshape(label2_input, [-1, params.experts_units])
len_layers = len(params.hidden_units)
with tf.variable_scope('ctr_deep'):
dense_ctr = tf.layers.dense(inputs=label1_input, units=params.hidden_units[0], activation=tf.nn.relu)
for i in range(1, len_layers):
dense_ctr = tf.layers.dense(inputs=dense_ctr, units=params.hidden_units[i], activation=tf.nn.relu)
ctr_out = tf.layers.dense(inputs=dense_ctr, units=1)
with tf.variable_scope('cvr_deep'):
dense_cvr = tf.layers.dense(inputs=label2_input, units=params.hidden_units[0], activation=tf.nn.relu)
for i in range(1, len_layers):
dense_cvr = tf.layers.dense(inputs=dense_cvr, units=params.hidden_units[i], activation=tf.nn.relu)
cvr_out = tf.layers.dense(inputs=dense_cvr, units=1)
ctr_score = tf.identity(tf.nn.sigmoid(ctr_out), name='ctr_score')
cvr_score = tf.identity(tf.nn.sigmoid(cvr_out), name='cvr_score')
ctcvr_score = ctr_score * cvr_score
ctcvr_score = tf.identity(ctcvr_score, name='ctcvr_score')
score = tf.add(ctr_score * params.label1_weight, cvr_score * params.label2_weight)
score = tf.identity(score, name='score')
if mode == tf.estimator.ModeKeys.PREDICT:
return tf.estimator.EstimatorSpec(mode=mode, predictions=score)
else:
ctr_labels = tf.identity(labels['label'], name='ctr_labels')
ctcvr_labels = tf.identity(labels['label2'], name='ctcvr_labels')
ctr_auc = tf.metrics.auc(labels=ctr_labels, predictions=ctr_score, name='auc')
ctcvr_auc = tf.metrics.auc(labels=ctcvr_labels, predictions=ctcvr_score, name='auc')
metrics = {
'ctr_auc': ctr_auc,
'ctcvr_auc': ctcvr_auc
}
# ctr_loss = tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(labels=ctr_labels, logits=ctr_out))
ctr_loss = tf.reduce_mean(tf.losses.log_loss(labels=ctr_labels, predictions=ctr_score))
ctcvr_loss = tf.reduce_mean(tf.losses.log_loss(labels=ctcvr_labels, predictions=ctcvr_score))
loss = ctr_loss + ctcvr_loss
if mode == tf.estimator.ModeKeys.TRAIN:
optimizer = tf.train.AdamOptimizer(params.learning_rate)
train_op = optimizer.minimize(loss=loss, global_step=tf.train.get_global_step())
else:
train_op = None
10.特征缺失处理
在TensorFlow中,tf.cond()类似于c语言中的if...else...,用来控制数据流向。
tf.cond(
pred,
true_fn=None,
false_fn=None,
strict=False,
name=None,
fn1=None,
fn2=None
)
a = tf.constant(1)
b = tf.constant(2)
p = tf.constant(True)
x = tf.cond(p, lambda: a + b, lambda: a * b)
print(tf.Session().run(x))
# Output: 3
如上,主要使用的有三个参数,所以可以简化为tf.cond(pred, fn1, fn2)的形式,类似于 java 中的 "? :"的三元运算符。
11.常用底层操作
11.1 抽取模型特定层特征
深度学习具有强大的特征表达能力。有时候我们训练好分类模型,并不想用来进行分类,而是用来提取特征用于其他任务,比如相似图片计算。
# 1).需要在模型中命名好取的那一层
...
h_pool_flat = tf.reshape(self.h_pool, [-1, num_filters_total], name='h_pool_flat')
...
# 2).通过调用sess.run()来获取h_pool_flat层特征
feature = graph.get_operation_by_name("h_pool_flat").outputs[0]
batch_predictions, batch_feature =
sess.run([predictions, feature], {input_x: x_test_batch, dropout_keep_prob: 1.0})
11.2 模型各组件不同优化器
def _train_op_fn(loss):
"""Returns the op to optimize the loss."""
train_ops = []
global_step = tf.train.get_global_step()
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
if dnn_logits is not None:
train_ops.append(
dnn_optimizer.minimize(
loss,
global_step=global_step,
var_list=tf.get_collection(
tf.GraphKeys.TRAINABLE_VARIABLES,
scope=dnn_parent_scope)))
if linear_logits is not None:
train_ops.append(
linear_optimizer.minimize(
loss,
global_step = global_step,
var_list = tf.get_collection(
tf.GraphKeys.TRAINABLE_VARIABLES,
scope = linear_parent_scope)))
if cnn_logits is not None:
train_ops.append(
cnn_optimizer.minimize(
loss,
global_step = global_step,
var_list = tf.get_collection(
tf.GraphKeys.TRAINABLE_VARIABLES,
scope = cnn_parent_scope)))
# 组合不同部分的优化器
train_op = tf.group(*train_ops)
with tf.control_dependencies([train_op]):
# 累加全局步数
with tf.colocate_with(global_step):
return tf.assign_add(global_step, 1)
tf.group()用于组合多个操作,ops = tf.group(tensor1, tensor2,...) 其中*inputs是0个或者多个用于组合tensor,一旦ops完成了,那么传入的tensor1,tensor2,...等等都会完成,经常用于组合一些训练节点。