din算法 代码_策略算法工程师之路-深度模型常见结构以及实现(Tensorflow)-2

本文详细介绍了深度学习模型中的注意力机制,包括简单Attention网络、多头注意力网络和自注意力网络,以及特征交叉网络如FM、Cross和CIN。此外,还探讨了L1、L2和Group Lasso正则化方法在模型中的应用。同时,文章提到了多任务学习的简单实现和专家网络MMoE的概念,以及如何在TensorFlow中处理特征缺失的问题。
摘要由CSDN通过智能技术生成

前文:

洪九(李戈):策略算法工程师之路-深度模型常见结构以及实现(Tensorflow)-1

目录

6.Attention(注意力机制)

6.1 简单Attention网络

6.2 Multi-headed attention(多头注意力网络)

6.3 Self-Attention自注意力网络

7.特征交叉网络

7.1 FM交叉网络

7.2 Cross交叉网络

7.3 CIN交叉网络

8.正则化

8.1 L2正则化

8.2 L1正则化

8.3 Group Lasso

9.多任务学习

9.1 简单多任务

9.2 专家网络

10.特征缺失处理

11.常用底层操作

6.Attention(注意力机制)

ba7ae00c6340993a775e97487995bb8b.png

注意力机制(AttentionMechanism)作为一种资源分配方案,将有限的计算资源用来处理更重要的信息,是解决信息超载问题的主要手段。

注意力机制分为如下两步:

  • Step1:在所有输入信息上计算注意力分布。

给定任务相关的查询向量

,计算与
个输入向量
相关性
,
称为
注意力分布。相关性的打分函数
可以有多种形式,比如:

加性模型:

点积模型:

缩放点积模型:

双线性模型:

  • Step2:根据注意力分布来计算输入信息的加权平均。

6.1 简单Attention网络

89d8a8b8a722e02d2e1c4c76f4a4f09a.png

上图是最基本的注意力网络,可见于AFM、DIN等网络。以DIN网络为例:

a116dd32a1d19d45c17889ee8ece13fb.png
Deep Interest Network

Activation Unit的结构为:

17263cb343bf4f0a01200cd49be3f338.png

示例代码如下:

def attention(queries, keys, keys_length):
    """
      queries:     [B, H] 前面的B代表的是batch_size,H代表向量维度。
      keys:        [B, T, H] T是一个batch中,当前特征最大的长度,每个样本代表一个样本的特征
      keys_length: [B]
    """
    # H 每个query词的隐藏层神经元是多少,也就是H
    queries_hidden_units = queries.get_shape().as_list()[-1]
    # tf.tile为复制函数,1代表在B上保持一致,tf.shape(keys)[1] 代表在H上复制这么多次, 
    # 那么queries最终shape为(B, H*T)
    queries = tf.tile(queries, [1, tf.shape(keys)[1]])
    # queries.shape(B,T,H) 其中每个元素(T,H)代表T行H列,其中每个样本中,每一行的数据都是一样的
    queries = tf.reshape(queries, [-1, tf.shape(keys)[1], queries_hidden_units])

    # 下面4个变量的shape都是(B, T, H),按照最后一个维度concat,所以shape是(B, T, H*4), 
    # 在这块就将特征中的每个item和目标item连接在了一起
    din_all = tf.concat([queries, keys, queries - keys, queries * keys], axis=-1)
    # (B, T, 80)
    d_layer_1_all = tf.layers.dense(din_all, 80, activation=tf.nn.sigmoid, name='f1_att', reuse=tf.AUTO_REUSE)
    # (B, T, 40)
    d_layer_2_all = tf.layers.dense(d_layer_1_all, 40, activation=tf.nn.sigmoid, name='f2_att', reuse=tf.AUTO_REUSE)
    # (B, T, 1)
    d_layer_3_all = tf.layers.dense(d_layer_2_all, 1, activation=None, name='f3_att', reuse=tf.AUTO_REUSE)
    # (B, 1, T)
    # 每一个样本都是 [1,T] 的维度,和原始特征的维度一样
    # 但是这时候每个item已经是特征中的一个item和目标item混在一起的数值了
    d_layer_3_all = tf.reshape(d_layer_3_all, [-1, 1, tf.shape(keys)[1]])
    outputs = d_layer_3_all

    # Mask,每一行都有T个数字,keys_length长度为B,假设第1 2个数字是5,6,那么key_masks第1 2行的前5 6个数字为True
    key_masks = tf.sequence_mask(keys_length, tf.shape(keys)[1])  # [B, T]
    key_masks = tf.expand_dims(key_masks, 1)                      # [B, 1, T]
    # 创建一个和outputs的shape保持一致的变量,值全为1,再乘以(-2 ** 32 + 1),所以每个值都是(-2 ** 32 + 1)
    paddings = tf.ones_like(outputs) * (-2 ** 32 + 1)
    outputs = tf.where(key_masks, outputs, paddings)              # [B, 1, T]

    # Scale 根据特征数目来做拉伸
    outputs = outputs / (keys.get_shape().as_list()[-1] ** 0.5)
    # Activation
    outputs = tf.nn.softmax(outputs)    # [B, 1, T]
    # 加权求和
    outputs = tf.matmul(outputs, keys)  # [B, 1, H]

    return outputs

Seq2Seq中的注意力机制:

c041c7821d4cc1e6860a8da54577028a.png
import tensorflow as tf
def attention(H):
    # H:[batch_size, time_step, hidden_size]
    H_shape = H.shape.as_list()
    time_step, hidden_size = H_shape[1], H_shape[2]
    h_t = tf.Variable(tf.truncated_normal(shape=[hidden_size, 1], stddev=0.5, dtype=tf.float32))
    # W:[hidden_size, hidden_size]
    W = tf.Variable(tf.truncated_normal(shape=[hidden_size, hidden_size], 
                                                                  stddev=0.5, dtype=tf.float32))
    # score: [batch_size*time_step, 1]
    score = tf.matmul(tf.matmul(tf.reshape(H, [-1,hidden_size]), W), h_t)  
    # score: [batch_size, time_step, 1]
    score = tf.reshape(score,[-1, time_step, 1])
    # alpha:[batch_size, time_step, 1]
    alpha = tf.nn.softmax(score)
    c_t = tf.matmul(tf.transpose(H, [0, 2, 1]), alpha)
    return tf.tanh(c_t)

6.2 Multi-headed attention(多头注意力网络)

585e2488e512db416ad6d4fdc951f00b.png
Scaled dot-Product attention

Scaled dot-Product attention就是我们常用的使用点积进行相似度计算的attention

e23f27752396f1b934587567d04a10e1.png

Multi-head attention(多头Head)结构如上图,

首先经过一个线性变换,然后输入到放缩点积
,注意这里要做
次(每一次算一个头),每次
进行线性变换的参数
是不一样的,最后将
次的放缩点积
结果进行拼接再进行一次线性变换得到最终多头attention的结果。
Multi-head attention可以允许模型在不同的表示子空间里学习到相关的信息。公式表述如下:

参考代码如下:

62ae6a5f8de5363f19565d5613c10fbf.png
def scaled_dot_product_attention(query, key, value, mask):	
  matmul_qk = tf.matmul(query, key, transpose_b=True)	
	
  depth = tf.cast(tf.shape(key)[-1], tf.float32)	
  logits = matmul_qk / tf.math.sqrt(depth)	
	
  # add the mask zero out padding tokens.	
  if mask is not None:	
     logits += (mask * -1e9)	
	
  attention_weights = tf.nn.softmax(logits, axis=-1)	
	
return tf.matmul(attention_weights, value)

多头部分:

# 多头注意力网络
def multihead_attention(queries, keys, values,
                        num_heads=8,
                        dropout_rate=0,
                        training=True,
                        causality=False,
                        scope="multihead_attention"):
    '''
        这里是将不同的Queries、Keys和values方式线性地投影h次是有益的。
        线性投影分别为dk,dk和dv尺寸。在每个预计版本进行queries、keys、values,
        然后并行执行attention功能,产生dv维输出值。这些被连接并再次投影,产生最终值
        :param queries: 三维张量[N, T_q, d_model]
        :param keys   : 三维张量[N, T_k, d_model]
        :param values : 三维张量[N, T_k, d_model]
        :param num_heads: heads数
        :param dropout_rate:
        :param training : 控制dropout机制
        :param causality: 控制是否遮盖
        :param scope:
        :return: 三维张量(N, T_q, C)
    '''
    d_model = queries.get_shape().as_list()[-1]
    with tf.variable_scope(scope, reuse=tf.AUTO_REUSE):
        # Linear projections
        Q = tf.layers.dense(queries, d_model, use_bias=False) # (N, T_q, d_model)
        K = tf.layers.dense(keys, d_model, use_bias=False)    # (N, T_k, d_model)
        V = tf.layers.dense(values, d_model, use_bias=False)  # (N, T_k, d_model)

        # Split and concat
        Q_ = tf.concat(tf.split(Q, num_heads, axis=2), axis=0) # (h*N, T_q, d_model/h)
        K_ = tf.concat(tf.split(K, num_heads, axis=2), axis=0) # (h*N, T_k, d_model/h)
        V_ = tf.concat(tf.split(V, num_heads, axis=2), axis=0) # (h*N, T_k, d_model/h)

        # Attention
        outputs = scaled_dot_product_attention(Q_, K_, V_, causality, dropout_rate, training)

        # Restore shape
        outputs = tf.concat(tf.split(outputs, num_heads, axis=0), axis=2 ) # (N, T_q, d_model)

        # Residual connection
        outputs += queries

        # Normalize
        outputs = ln(outputs)

    return outputs

6.3 Self-Attention自注意力网络

Self Attention机制在

模型中的特殊点在于
,这也是为什么取名self attention。

8fabea11c0b0013acf8f3d83d9575aa5.png

7.特征交叉网络

显式的特征交叉是对深度模型隐式特征交叉的补充,在实践中有着广泛的应用。常见的特征交叉结构有FM、Cross网络、CIN等。

7.1 FM交叉网络

FM交叉网络实现的是特征间的二阶交叉,典型代表DeepFM

704dba3274049d16aa80688a23d9ff81.png

特征交叉计算公式如下:

10c1bff7e032a689d0534e43b1135d60.png

示例代码:

def second_order_part(self, sparse_id, sparse_value):
    with tf.variable_scope("second-order"):
        V = tf.get_variable("weight",(self.feature_size, self.factor_size),
                initializer=tf.random_normal_initializer(0.0, 0.01))
        self.embeddings = tf.nn.embedding_lookup(V, sparse_id)

        # None * F * K
        self.embeddings = tf.multiply(self.embeddings, sparse_value) 

        # 和平方:None * K
        sum_squared_part = tf.square(tf.reduce_sum(self.embeddings, 1)) 
        # 平方和:None * K
        squared_sum_part = tf.reduce_sum(tf.square(self.embeddings), 1) 

        y_second_order = 0.5 * tf.subtract(sum_squared_part, 
                                    squared_sum_part)
        return y_second_order

7.2 Cross交叉网络

351aa075cb7c80cdbb2769d55eb91620.png

Cross交叉网络可以实现特征间任意阶交叉(实践中根据需求确定 layer depth),典型的用于DCN模型中。参考代码如下:

def cross_layer(x0, x, name):
  with tf.variable_scope(name):
    input_dim = x0.get_shape().as_list()[1]
    w = tf.get_variable("weight", [input_dim], initializer
                                   =tf.truncated_normal_initializer(stddev=0.01))
    b = tf.get_variable("bias", [input_dim], initializer
                                   =tf.truncated_normal_initializer(stddev=0.01))
    xb = tf.tensordot(tf.reshape(x, [-1, 1, input_dim]), w, 1)
    return x0 * xb + b + x

这里有个trick,改变下运算顺序(矩阵运算结合律)可以极大的提高网络性能,如下图:

240f80f704f4f0688a648d757b91fd30.png

Cross特殊的网络结构使得cross feature的阶数随着layer depth的增加而增加。相对于输入

来说,一个
层的cross network的cross feature的阶数为
。上文介绍的FM是一个非常浅的结构,并且限制在表达二阶组合特征上,Cross把这种参数共享的思想从一层扩展到多层,并且可以学习高阶的特征组合。与FM的高阶版本的变体不同,Cross的
参数随着输入维度的增长是线性增长的

3728ab5de8196281de44267f8db6e058.png

参考资料:

杨旭东:玩转企业级Deep&Cross Network模型你只差一步​zhuanlan.zhihu.com
1bc84e84bbe94c71f78338460f50e09d.png

7.3 CIN交叉网络

9670f0ae6abfb6bdc46fd59975d3b3bd.png

CIN(Compressed Interaction Network)对Cross网络做了改进,CIN是显示的高阶特征交互,并且是vector-wise level,进一步提高了特征交互的能力。计算过程如下:

  • 外积操作

cf1b9e8738f37f2b72e6a0f451208253.png

首先引入过渡张量

,是
与输入
的外积。其中
是CIN网络的原始Embedding输入,
交互得到。
  • 特征压缩

在这一步将上面计算得到的

视为图片,用
个尺寸为
过滤器(
filter)对
沿着
维度做卷积,得到
。如下图所示:

091beccda84309a2bd2b94619062fcaf.png
  • 特征拼接

对得到的

通过Sum Pooling操做拼接成一维特征,如下图:

45beff2afd632790e97156930b2410fc.png

参考代码如下:

# Embedding向量维度
D = Config.embedding_size
final_result = []
final_len = 0

# 原始DNN输入特征
nn_input = tf.reshape(dnn_input,
              shape=[-1, self.field_size, Config.embedding_size])
# 缓存CIN各层
cin_layers = [nn_input]
field_nums = [self.field_size]
# 将原始输入从最后一个维度切分成D列
split_tensor_0 = tf.split(nn_input, D * [1], 2)

# 循环建立多层CIN网络
for idx, layer_size in enumerate(Config.cross_layer_size):
      # 将最新层从最后一个维度切分成D列
      now_tensor = tf.split(cin_layers[-1], D * [1], 2)
      # 外积操作 H_{k} x m
      dot_result_m = tf.matmul(split_tensor_0, now_tensor, transpose_b=True)
      # 构造Z,这里为Z^{样本数 x D x (H_{k}*m) }
      dot_result_o = tf.reshape(dot_result_m, shape=[D, -1, field_nums[0] * field_nums[-1]])
      dot_result = tf.transpose(dot_result_o, perm=[1, 0, 2])
      # 特征压缩(构造滤波器->卷积操作->激活函数->交换维度)
      filters = tf.get_variable(name="f_" + str(idx), shape=[1, field_nums[-1] * field_nums[0], layer_size],
                                      dtype=tf.float32)
      curr_out = tf.nn.conv1d(dot_result, filters=filters, stride=1, padding='VALID')
      b = tf.get_variable(name="f_b" + str(idx), shape=[layer_size], dtype=tf.float32,
                                initializer=tf.zeros_initializer())
      curr_out = tf.nn.relu(tf.nn.bias_add(curr_out, b))
      curr_out = tf.transpose(curr_out, perm=[0, 2, 1])
      if Config.cross_direct:
          direct_connect = curr_out
          next_hidden = curr_out
          final_len += layer_size
          field_nums.append(int(layer_size))
      else:
          if idx != len(Config.cross_layer_size) - 1:
               next_hidden, direct_connect = tf.split(curr_out, 2 * [int(layer_size / 2)], 1)
               final_len += int(layer_size / 2)
          else:
               direct_connect = curr_out
               next_hidden = 0
               final_len += layer_size
               field_nums.append(int(layer_size / 2))
      # 保存最新层
      final_result.append(direct_connect)
      cin_layers.append(next_hidden)

# 特征拼接(Sum Pooling)
result = tf.concat(final_result, axis=1)
result = tf.reduce_sum(result, -1)

# 特征拼接
w_nn_output1 = tf.get_variable(name='w_nn_output1', shape=[final_len, Config.cross_output_size],
                                       dtype=tf.float32)
b_nn_output1 = tf.get_variable(name='b_nn_output1', shape=[Config.cross_output_size],dtype=tf.float32,
                                       initializer=tf.zeros_initializer())
CIN_out = tf.nn.xw_plus_b(result, w_nn_output1, b_nn_output1)

8.正则化

正则化可以帮助我们惩罚特征权重,即特征的权重也会成为模型损失函数的一部分。可以理解为, 为了使用某个特征,我们需要付出loss的代价,除非这个特征非常有效,否则就会被loss上的增加覆盖效果。这样我们就能筛选出最有效的特征,减少特征权重防止过拟合。一般来说,L1正则会制造稀疏的特征,大部分无用特征的权重会被至为0,L2正则会让特征的权重不过大,使得特征的权重比较平均。

8.1 L2正则化

37237c47f9a10ff49438f88f4f50b639.png

首先,在声明权重变量时,将正则化损失添加到特定集合中:

def get_weights(shape, weight_decay=0.0, dtype=tf.float32, trainable=True):
    """
    add weight regularization to loss collection
    Args:
        shape: 
        weight_decay: 
        dtype: 
        trainable: 
    Returns:
    """
    weight = tf.Variable(initial_value=tf.truncated_normal(shape=shape, stddev=0.01),
                             name='Weights', dtype=dtype, trainable=trainable)
    if weight_decay > 0:
        # 第一步:计算正则化损失
        weight_loss = tf.nn.l2_loss(weight) * weight_decay
        # 第二步:将正则化损失添加到特定集合中(tensorflow内置集合或自定义集合)
        tf.add_to_collection(tf.GraphKeys.REGULARIZATION_LOSSES, value = weight_loss)
    else:
        pass
    return weight

然后,在计算总损失时累加权重本身的损失和正则化损失:

 with tf.variable_scope("loss"):
        cross_entropy = tf.nn.softmax_cross_entropy_with_logits(logits=logits,                                   labels=input_label_placeholder,name='entropy')
        loss_op = tf.reduce_mean(input_tensor=cross_entropy, name='loss')
        weight_loss_op = tf.losses.get_regularization_losses()
        weight_loss_op = tf.add_n(weight_loss_op)
        total_loss_op = loss_op + weight_loss_op 

最后,将损失给到优化器:

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
with tf.Session(config=config) as sess:
    sess.run(init_op)
    input_data, input_label = sess.run([data_batch, label_batch])
      
    # 优化器
    train_op = tf.train.GradientDescentOptimizer(learning_rate=LEARNING_RATE).minimize(loss=total_loss_op,
                                                                                           global_step=global_step)

    feed_dict = {input_data_placeholder:input_data, input_label_placeholder:input_label}

    _, total_loss, loss, weight_loss = sess.run([train_op, total_loss_op, loss_op, weight_loss_op],
                                                             feed_dict=feed_dict)

参考资料:

我继续:tensorflow损失函数加上正则化

8.2 L1正则化

56dcdb80d2a9bbf3783366d3523bb3fc.png

代码同上。

b01ed5269ad67a55b30f9879323f9f25.png

从图上可以看出L2正则实际上就是做了一个放缩,而L1正则实际是做了一个soft thresholding,把很多权重项置0了,所以就得到了稀疏的结果。

8.3 Group Lasso

Yuan在2006年将lasso方法推广到group上面,诞生了group lasso。我们可以将所有变量分组,然后在目标函数中惩罚每一组的L2范数,这样达到的效果就是可以将一整组的系数同时消成零,即抹掉一整组的变量,这种方法叫做Group Lasso 分组最小角回归算法。容易看出,group lasso是对lasso的一种推广,即将特征分组后的lasso,如果每个组的特征个数都是1,则group lasso就回归到原始的lasso。其目标函数为:

7b6720864f5db0233bf51a71d401b185.png
def group_lasso(alpha,scale,groups):
    # groups_size=len(set(groups)) # number of groups
    all_variables=tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES)
    all_weights=[x for x in all_variables if re.search("weights",x.name)]
    weights1=[x for x in all_weights if re.search("hidden1/weights",x.name)][0] 
    weights_others=[x for x in all_weights if not re.search("hidden1/weights",x.name)]

    # group lasso regularization for the input-hidden1 weights
    regularizer = tf.contrib.layers.l2_regularizer(scale=scale)
    rg=0.0
    group_index=0
    for group_id in list(set(groups)):
        this_group_mask=[i for i,x in enumerate(groups) if x== group_id]
        pl=len(this_group_mask)
        rg+=math.sqrt(pl)*tf.sqrt(regularizer(tf.gather(weights1,tf.to_int64(this_group_mask))))
    regularizer2 = tf.contrib.layers.l1_regularizer(scale=scale)
    if(alpha==1):
	pass
    else:
	rg=rg*alpha+(1-alpha)*tf.contrib.layers.apply_regularization(regularizer2, weights_others)
    return rg

7dfa5a928e913af1e0f98b44c3fbb674.png
def sparse_group_lasso(alpha,scale,groups):
    all_variables=tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES)
    all_weights=[x for x in all_variables if re.search("weights",x.name)]
    weights1=[x for x in all_weights if re.search("hidden1/weights",x.name)][0]

    # group lasso regularization for the input-hidden1 weights
    regularizer = tf.contrib.layers.l2_regularizer(scale=scale)
    rg=0.0
    group_index=0
    for group_id in list(set(groups)):
        this_group_mask=[i for i,x in enumerate(groups) if x== group_id]
        pl=len(this_group_mask)
        rg+=math.sqrt(pl)*tf.sqrt(regularizer(tf.gather(weights1,tf.to_int64(this_group_mask))))
    regularizer2 = tf.contrib.layers.l1_regularizer(scale=scale)
    rg=alpha*rg+(1-alpha)*tf.contrib.layers.apply_regularization(regularizer2, all_weights)
    return rg
https://github.com/WGLab/GDP​github.com

9.多任务学习

多任务学习的重要特点是单个输入

,多个不同输出
。因此在定义占位符时要定义多个输出,同样也需要有多个损失函数分别用于计算每个任务的损失。

9.1 简单多任务

56ba7801bba68d9a4386d9885a9c8b5a.png
多任务学习示意图
import Tensorflow as tf

# 定义占位符
X = tf.placeholder("float", [10, 10], name="X")
Y1 = tf.placeholder("float", [10, 20], name="Y1")
Y2 = tf.placeholder("float", [10, 20], name="Y2")
 
# 权重定义
initial_shared_layer_weights = np.random.rand(10,20)
initial_Y1_layer_weights = np.random.rand(20,20)
initial_Y2_layer_weights = np.random.rand(20,20)

shared_layer_weights = tf.Variable(initial_shared_layer_weights, name="share_W", dtype="float32")
Y1_layer_weights = tf.Variable(initial_Y1_layer_weights, name="share_Y1", dtype="float32")
Y2_layer_weights = tf.Variable(initial_Y2_layer_weights, name="share_Y2", dtype="float32")
 
# 使用relu激活函数构建层
shared_layer = tf.nn.relu(tf.matmul(X,shared_layer_weights))
Y1_layer = tf.nn.relu(tf.matmul(shared_layer,Y1_layer_weights))
Y2_layer = tf.nn.relu(tf.matmul(shared_layer,Y2_layer_weights))
 
# 计算loss
Y1_Loss = tf.nn.l2_loss(Y1-Y1_layer)
Y2_Loss = tf.nn.l2_loss(Y2-Y2_layer)

交替训练:

dc53db3442ec6c35a9d16b543be466d4.png
# 优化器
Y1_op = tf.train.AdamOptimizer().minimize(Y1_Loss)
Y2_op = tf.train.AdamOptimizer().minimize(Y2_Loss)

with tf.Session() as session:
    session.run(tf.initialize_all_variables())
    for iters in range(10):
        if np.random.rand() < 0.5:
            _, Y1_loss = session.run([Y1_op, Y1_Loss],
                            {
                              X: np.random.rand(10,10)*10,
                              Y1: np.random.rand(10,20)*10,
                              Y2: np.random.rand(10,20)*10
                              })
            print(Y1_loss)
        else:
            _, Y2_loss = session.run([Y2_op, Y2_Loss],
                            {
                              X: np.random.rand(10,10)*10,
                              Y1: np.random.rand(10,20)*10,
                              Y2: np.random.rand(10,20)*10
                              })
            print(Y2_loss)

联合训练:

34db32aeacd2312ea3d508bd2d21144a.png
# 计算Loss
Y1_Loss = tf.nn.l2_loss(Y1-Y1_layer)
Y2_Loss = tf.nn.l2_loss(Y2-Y2_layer)

# 两个loss相加(核心)
Joint_Loss = Y1_Loss + Y2_Loss

# 优化器
Optimiser = tf.train.AdamOptimizer().minimize(Joint_Loss)

# 联合训练 
with tf.Session() as session:
    session.run(tf.initialize_all_variables())
    _, Joint_Loss = session.run([Optimiser, Joint_Loss],
                    {
                      X: np.random.rand(10,10)*10,
                      Y1: np.random.rand(10,20)*10,
                      Y2: np.random.rand(10,20)*10
                      })
    print(Joint_Loss)

参考代码:

https://github.com/jg8610/multi-task-part-1-notebook/blob/master/Multi-Task%20Learning%20Tensorflow%20Part%201.ipynb​github.com https://github.com/jg8610/multi-task-learning/blob/master/graph.py​github.com

9.2 专家网络

MMoE,Multi-gate Mixture-of-Experts。对于不同的任务(CTR、CVR等),模型的权重选择是不同的,为此作者为每个任务都配备一个

模型。对于不同的任务(K为任务数),特定的
的输出表示不同的
被选择的概率,将多个
加权和。

7b39427a4938e7b2975300182430d8dc.png

共享变量(Shared Bottom):

def model_fn(features, labels, mode, params):
    tf.set_random_seed(2019)

    cont_feats = features["cont_feats"]
    cate_feats = features["cate_feats"]
    vector_feats = features["vector_feats"]

    single_cate_feats = cate_feats[:, 0:params.cate_field_size]
    multi_cate_feats = cate_feats[:, params.cate_field_size:]
    cont_feats_index = tf.Variable([[i for i in range(params.cont_field_size)]], trainable=False, dtype=tf.int64,
                                   name="cont_feats_index")

    cont_index_add = tf.add(cont_feats_index, params.cate_feats_size)

    index_max_size = params.cont_field_size + params.cate_feats_size
    feats_emb = my_layer.emb_init(name='feats_emb', feat_num=index_max_size, embedding_size=params.embedding_size)

    # cont_feats -> Embedding
    with tf.name_scope("cont_feat_emb"):
        ori_cont_emb = tf.nn.embedding_lookup(feats_emb, ids=cont_index_add, name="ori_cont_emb")
        cont_value = tf.reshape(cont_feats, shape=[-1, params.cont_field_size, 1], name="cont_value")
        cont_emb = tf.multiply(ori_cont_emb, cont_value)
        cont_emb = tf.reshape(cont_emb, shape=[-1, params.cont_field_size * params.embedding_size], name="cont_emb")

    # single_category -> Embedding
    with tf.name_scope("single_cate_emb"):
        cate_emb = tf.nn.embedding_lookup(feats_emb, ids=single_cate_feats)
        cate_emb = tf.reshape(cate_emb, shape=[-1, params.cate_field_size * params.embedding_size])

    # multi_category -> Embedding
    with tf.name_scope("multi_cate_emb"):
        multi_cate_emb = my_layer.multi_cate_emb(params.multi_feats_range, feats_emb, multi_cate_feats)

    # deep input dense
    dense_input = tf.concat([cont_emb, vector_feats, cate_emb, multi_cate_emb], axis=1, name='dense_vector')

专家网络:

a6170f11817ec3552f81a7de55be765b.png
def model_fn(features, labels, mode, params):
    ... 共享变量部分

    # deep input dense
    dense_input = tf.concat([cont_emb, vector_feats, cate_emb, multi_cate_emb],
                                axis=1, name='dense_vector')

    # experts
    experts_weight = tf.get_variable(name='experts_weight',
                       dtype=tf.float32,
                       shape=(dense_input.get_shape()[1], params.experts_units, params.experts_num),
                       initializer=tf.contrib.layers.xavier_initializer())
    experts_bias = tf.get_variable(name='expert_bias',
                       dtype=tf.float32,
                       shape=(params.experts_units, params.experts_num),
                       initializer=tf.contrib.layers.xavier_initializer())

    # f_{i}(x) = activation(W_{i} * x + b)
    experts_output = tf.tensordot(dense_input, experts_weight, axes=1)
    use_experts_bias = True
    if use_experts_bias:
        experts_output = tf.add(experts_output, experts_bias)
    experts_output = tf.nn.relu(experts_output)

门控(gate)网络:

fbe74fe0998d0d4638bf98a9b5dccb8d.png
def model_fn(features, labels, mode, params):
    
    ...

    # gates
    gate1_weight = tf.get_variable(name='gate1_weight',
                                   dtype=tf.float32,
                                   shape=(dense_input.get_shape()[1], params.experts_num),
                                   initializer=tf.contrib.layers.xavier_initializer())
    gate1_bias = tf.get_variable(name='gate1_bias',
                                 dtype=tf.float32,
                                 shape=(params.experts_num,),
                                 initializer=tf.contrib.layers.xavier_initializer())
    gate2_weight = tf.get_variable(name='gate2_weight',
                                   dtype=tf.float32,
                                   shape=(dense_input.get_shape()[1], params.experts_num),
                                   initializer=tf.contrib.layers.xavier_initializer())
    gate2_bias = tf.get_variable(name='gate2_bias',
                                 dtype=tf.float32,
                                 shape=(params.experts_num,),
                                 initializer=tf.contrib.layers.xavier_initializer())

     # g^{k}(x) = activation(W_{gk} * x + b), where activation is softmax according to the paper
     gate1_output = tf.matmul(dense_input, gate1_weight)
     gate2_output = tf.matmul(dense_input, gate2_weight)
     user_gate_bias = True
     if user_gate_bias:
         gate1_output = tf.add(gate1_output, gate1_bias)
         gate2_output = tf.add(gate2_output, gate2_bias)
     gate1_output = tf.nn.softmax(gate1_output)
     gate2_output = tf.nn.softmax(gate2_output)
     ...

多任务融合:

3c0fdc36eaaf940a036b03acfa3d5a4d.png
def model_fn(features, labels, mode, params):
    
    ...

    # f^{k}(x) = sum_{i=1}^{n}(g^{k}(x)_{i} * f_{i}(x))
    label1_input = tf.multiply(experts_output, tf.expand_dims(gate1_output, axis=1))
    label1_input = tf.reduce_sum(label1_input, axis=2)
    label1_input = tf.reshape(label1_input, [-1, params.experts_units])
    label2_input = tf.multiply(experts_output, tf.expand_dims(gate2_output, axis=1))
    label2_input = tf.reduce_sum(label2_input, axis=2)
    label2_input = tf.reshape(label2_input, [-1, params.experts_units])

    len_layers = len(params.hidden_units)
    with tf.variable_scope('ctr_deep'):
        dense_ctr = tf.layers.dense(inputs=label1_input, units=params.hidden_units[0], activation=tf.nn.relu)
        for i in range(1, len_layers):
            dense_ctr = tf.layers.dense(inputs=dense_ctr, units=params.hidden_units[i], activation=tf.nn.relu)
        ctr_out = tf.layers.dense(inputs=dense_ctr, units=1)
    with tf.variable_scope('cvr_deep'):
        dense_cvr = tf.layers.dense(inputs=label2_input, units=params.hidden_units[0], activation=tf.nn.relu)
        for i in range(1, len_layers):
            dense_cvr = tf.layers.dense(inputs=dense_cvr, units=params.hidden_units[i], activation=tf.nn.relu)
        cvr_out = tf.layers.dense(inputs=dense_cvr, units=1)

    ctr_score = tf.identity(tf.nn.sigmoid(ctr_out), name='ctr_score')
    cvr_score = tf.identity(tf.nn.sigmoid(cvr_out), name='cvr_score')
    ctcvr_score = ctr_score * cvr_score
    ctcvr_score = tf.identity(ctcvr_score, name='ctcvr_score')

    score = tf.add(ctr_score * params.label1_weight, cvr_score * params.label2_weight)
    score = tf.identity(score, name='score')

    if mode == tf.estimator.ModeKeys.PREDICT:
        return tf.estimator.EstimatorSpec(mode=mode, predictions=score)

    else:
        ctr_labels = tf.identity(labels['label'], name='ctr_labels')
        ctcvr_labels = tf.identity(labels['label2'], name='ctcvr_labels')
        ctr_auc = tf.metrics.auc(labels=ctr_labels, predictions=ctr_score, name='auc')
        ctcvr_auc = tf.metrics.auc(labels=ctcvr_labels, predictions=ctcvr_score, name='auc')
        metrics = {
            'ctr_auc': ctr_auc,
            'ctcvr_auc': ctcvr_auc
        }
        # ctr_loss = tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(labels=ctr_labels, logits=ctr_out))
        ctr_loss = tf.reduce_mean(tf.losses.log_loss(labels=ctr_labels, predictions=ctr_score))
        ctcvr_loss = tf.reduce_mean(tf.losses.log_loss(labels=ctcvr_labels, predictions=ctcvr_score))
        loss = ctr_loss + ctcvr_loss

        if mode == tf.estimator.ModeKeys.TRAIN:
            optimizer = tf.train.AdamOptimizer(params.learning_rate)
            train_op = optimizer.minimize(loss=loss, global_step=tf.train.get_global_step())
        else:
            train_op = None

https://github.com/R-Stalker/deep_learning_estimator_labels/blob/master/models/mmoe.py​github.com Multi-task Learning in LM(多任务学习,MT-DNN,ERNIE2.0)​blog.csdn.net
599c4108a37bb30268eb5f213a9b451f.png

10.特征缺失处理

87f713cc46807b101c3458fd60633e02.png

在TensorFlow中,tf.cond()类似于c语言中的if...else...,用来控制数据流向。

tf.cond(
    pred,
    true_fn=None,
    false_fn=None,
    strict=False,
    name=None,
    fn1=None,
    fn2=None
)

a = tf.constant(1)
b = tf.constant(2)
p = tf.constant(True)
x = tf.cond(p, lambda: a + b, lambda: a * b)
print(tf.Session().run(x))
# Output: 3

如上,主要使用的有三个参数,所以可以简化为tf.cond(pred, fn1, fn2)的形式,类似于 java 中的 "? :"的三元运算符。

11.常用底层操作

11.1 抽取模型特定层特征

深度学习具有强大的特征表达能力。有时候我们训练好分类模型,并不想用来进行分类,而是用来提取特征用于其他任务,比如相似图片计算。

# 1).需要在模型中命名好取的那一层
...
h_pool_flat = tf.reshape(self.h_pool, [-1, num_filters_total], name='h_pool_flat')			
...

# 2).通过调用sess.run()来获取h_pool_flat层特征
feature = graph.get_operation_by_name("h_pool_flat").outputs[0]
batch_predictions, batch_feature = 
sess.run([predictions, feature], {input_x: x_test_batch, dropout_keep_prob: 1.0})

11.2 模型各组件不同优化器

def _train_op_fn(loss):
    """Returns the op to optimize the loss."""
    train_ops = []
    global_step = tf.train.get_global_step()
    update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
    with tf.control_dependencies(update_ops):
        if dnn_logits is not None:
            train_ops.append(
                dnn_optimizer.minimize(
                    loss,
                    global_step=global_step,
                    var_list=tf.get_collection(
                        tf.GraphKeys.TRAINABLE_VARIABLES,
                        scope=dnn_parent_scope)))
        if linear_logits is not None:
            train_ops.append(
                linear_optimizer.minimize(
                    loss,
                    global_step = global_step,
                    var_list = tf.get_collection(
                        tf.GraphKeys.TRAINABLE_VARIABLES,
                        scope = linear_parent_scope)))
        if cnn_logits is not None:
            train_ops.append(
                cnn_optimizer.minimize(
                    loss,
                    global_step = global_step,
                    var_list = tf.get_collection(
                        tf.GraphKeys.TRAINABLE_VARIABLES,
                        scope = cnn_parent_scope)))
        # 组合不同部分的优化器
        train_op = tf.group(*train_ops)
    with tf.control_dependencies([train_op]):
        # 累加全局步数
        with tf.colocate_with(global_step):
            return tf.assign_add(global_step, 1)

tf.group()用于组合多个操作,ops = tf.group(tensor1, tensor2,...) 其中*inputs是0个或者多个用于组合tensor,一旦ops完成了,那么传入的tensor1,tensor2,...等等都会完成,经常用于组合一些训练节点。

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值