本文主要介绍tensorflow中mmoe的实现方式。
一.mmoe概念
先简单回忆下mmoe的概念:
https://blog.csdn.net/u013250416/article/details/118642297
二.已有keras版本实现思路分析
github中已经有keras版本mmoe的实现(https://github.com/drawbridge/keras-mmoe/blob/master/mmoe.py),通过已经有的代码,可以简单理顺一下mmoe中各个组件的维度。
class MMoE(Layer):
"""
Multi-gate Mixture-of-Experts model.
"""
def build(self, input_shape):
"""
Method for creating the layer weights.
:param input_shape: Keras tensor (future input to layer)
or list/tuple of Keras tensors to reference
for weight shape computations
"""
assert input_shape is not None and len(input_shape) >= 2
input_dimension = input_shape[-1]
# Initialize expert weights (number of input features * number of units per expert * number of experts)
self.expert_kernels = self.add_weight(
name='expert_kernel',
shape=(input_dimension, self.units, self.num_experts),
initializer=self.expert_kernel_initializer,
regularizer=self.expert_kernel_regularizer,
constraint=self.expert_kernel_constraint,
)
# Initialize expert bias (number of units per expert * number of experts)
if self.use_expert_bias:
self.expert_bias = self.add_weight(
name='expert_bias',
shape=(self.units, self.num_experts),
initializer=self.expert_bias_initializer,
regularizer=self.expert_bias_regularizer,
constraint=self.expert_bias_constraint,
)
# Initialize gate weights (number of input features * number of experts * number of tasks)
self.gate_kernels = [self.add_weight(
name='gate_kernel_task_{}'.format(i),
shape=(input_dimension, self.num_experts),
initializer=self.gate_kernel_initializer,
regularizer=self.gate_kernel_regularizer,
constraint=self.gate_kernel_constraint
) for i in range(self.num_tasks)]
# Initialize gate bias (number of experts * number of tasks)
if self.use_gate_bias:
self.gate_bias = [self.add_weight(
name='gate_bias_task_{}'.format(i),
shape=(self.num_experts,),
initializer=self.gate_bias_initializer,
regularizer=self.gate_bias_regularizer,
constraint=self.gate_bias_constraint
) for i in range(self.num_tasks)]
self.input_spec = InputSpec(min_ndim=2, axes={-1: input_dimension})
super(MMoE, self).build(input_shape)
def call(self, inputs, **kwargs):
"""
Method for the forward function of the layer.
:param inputs: Input tensor
:param kwargs: Additional keyword arguments for the base method
:return: A tensor
"""
gate_outputs = []
final_outputs = []
# f_{i}(x) = activation(W_{i} * x + b), where activation is ReLU according to the paper
expert_outputs = tf.tensordot(a=inputs, b=self.expert_kernels, axes=1)
# Add the bias term to the expert weights if necessary
if self.use_expert_bias:
expert_outputs = K.bias_add(x=expert_outputs, bias=self.expert_bias)
expert_outputs = self.expert_activation(expert_outputs)
# g^{k}(x) = activation(W_{gk} * x + b), where activation is softmax according to the paper
for index, gate_kernel in enumerate(self.gate_kernels):
gate_output = K.dot(x=inputs, y=gate_kernel)
# Add the bias term to the gate weights if necessary
if self.use_gate_bias:
gate_output = K.bias_add(x=gate_output, bias=self.gate_bias[index])
gate_output = self.gate_activation(gate_output)
gate_outputs.append(gate_output)
# f^{k}(x) = sum_{i=1}^{n}(g^{k}(x)_{i} * f_{i}(x))
for gate_output in gate_outputs:
expanded_gate_output = K.expand_dims(gate_output, axis=1)
weighted_expert_output = expert_outputs * K.repeat_elements(expanded_gate_output, self.units, axis=1)
final_outputs.append(K.sum(weighted_expert_output, axis=2))
return final_outputs
1.inputs维度: [batch_size, input_dimention]
2.expert权重: [input_dimention, hidden_size, num_experts]
那么,expert网络输出维度是: [batch_size, hidden_size, num_experts]
3.gate权重: [input_dimention, num_experts, num_tasks]
那么,gate网络输出维度是: [batch_size, num_experts, num_tasks]
由于在mmoe中,每个子任务都有一个自己的门控网络。对于每个子任务,都通过一个门控网络来实现对不同专家网络的组合。
因此,对于每个子任务,对应的gate网络的维度是: [batch_size, num_experts]
4.对于每个gate网络都有:
对gate网络进行softmax操作,得到每个expert网络的权重,对应的维度是: [batch_size, num_experts]
这里需要注意的是,expert网络输出维度是: [batch_size, hidden_size, num_experts];
而expert网络的权重对应的维度是: [batch_size, num_experts],因此需要对expert网络的权重进行在维度1上进行hidden_size次的重叠操作,得到[batch_size, hidden_size, num_experts]
接下来,将expert网络输出与expert网络的权重相乘并求和,就可以得到新的expert网络输出, [batch_size, hidden_size]
因此,整体的输出为: [batch_size, hidden_size] * num_tasks
5.后续可以对 [batch_size, hidden_size] 进行各种操作,来得到我们想要的输出。
三.tensorflow版本实现
GitHub - crediks/MMoE: MMoE completed by Tensorflow
这里,我们直接将第二节的说明翻译成tensorflow代码就可以了。
class MMoE(object):
def __init__(self, hidden_size, num_experts, num_tasks):
self.hidden_size = hidden_size
self.num_experts = num_experts
self.num_tasks = num_tasks
def get_output(self, inputs):
expert_weight = tf.get_variable(name='expert_weight', initializer=xavier_init, shape=[inputs.get_shape()[-1], self.hidden_size, self.num_experts])
expert_bias = tf.get_variable(name='expert_bias', initializer=xavier_init, shape=[self.hidden_size, self.num_experts])
# [batch_size, hidden_size, num_experts]
expert_output = tf.tensordot(inputs, expert_weight, axes=1) + expert_bias
expert_output = tf.nn.relu(expert_output, name='expert_output')
gate_weight = tf.get_variable(name='gate_weight', initializer=xavier_init, shape=[inputs.get_shape()[-1], self.num_experts, self.num_tasks])
gate_bias = tf.get_variable(name='gate_bias', initializer=xavier_init, shape=[self.num_experts, self.num_tasks])
# [batch_size, num_experts, num_tasks]
gate_output = tf.tensordot(inputs, gate_weight, axes=1) + gate_bias
# [batch_size, num_experts, 1] * num_tasks
gate_outputs = tf.split(gate_output, num_or_size_splits=self.num_tasks, axis=2)
final_outputs = []
for gate_output in gate_outputs:
# [batch_size, 1, num_experts]
gate_output = tf.transpose(gate_output, [0,2,1])
gate_output = tf.nn.softmax(gate_output, name='gate_output_softmax')
# [batch_size, hidden_size, num_experts]
gate_output = tf.tile(gate_output, [1, self.hidden_size, 1])
weighted_expert_output = expert_output * gate_output
# [batch_size, hidden_size]
final_outputs.append(tf.reduce_sum(weighted_expert_output, axis=2))
return final_outputs
有什么问题,欢迎在下方留言,和我讨论!