deep_learning 04. attention

最新推荐文章于 2020-05-20 09:27:58 发布

adowu

最新推荐文章于 2020-05-20 09:27:58 发布

阅读量452

点赞数

分类专栏： Models 文章标签： attention rnn

本文链接：https://blog.csdn.net/WUUUSHAO/article/details/88175373

版权

Models 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

开始的话：
从基础做起，不断学习，坚持不懈，加油。
一位爱生活爱技术来自火星的程序汪

RNN系列

BasicRNNCell
BasicLSTMCell
MultiRNNCell

说到RNN，我们就不得不说attention在RNN中的运用了。为啥使用attention我也就不多说了。

话不多说，直接上图（没水印且居中的图片，终于舒服了）。

看过前面几个章节的，图中 $m a t c h$ 以下的部分就不需要过多解释了。
$h_t$ 就是每个时间步的输出
$C_0$ 就是最后的 $s t a t e$ 输出

在 $B a h d a n a u A t t e n t i o n$ 中：
$u^t=v^Ttanh(W_1h + W_2d_t)$
$a^t=softmax(u^t)$
$c^t = \sum_t^{L}a^th_l$

$h$ 表示每个时间步的输出
$d_t$ 表示 $d e c o d e r$ 时候的状态
$v^T$ 表示 $w e i g h t s$ ,需要去学习的
剩下的就比较好理解了。

在分类的 $t a s k$ 中，是没有 $d e c o d e r$ 的

接下来会结合着 $d e m o$ 和上面的图片来详细说明下过程。

def attention(inputs, hidden_size, dropout, attention_size):
    """
    :param inputs: [B, T, D] -> [batch_size, sequence_length, embedding_size]
    :param hidden_size: RNN output size
    :param dropout: dropout rate
    :param attention_size: attention output size
    :return:
    """
    fw = tf.nn.rnn_cell.GRUCell(hidden_size, name='fw')
    bw = tf.nn.rnn_cell.GRUCell(hidden_size, name='bw')

    if dropout:
        fw = tf.nn.rnn_cell.DropoutWrapper(fw, output_keep_prob=dropout)
        bw = tf.nn.rnn_cell.DropoutWrapper(bw, output_keep_prob=dropout)

    output, _ = tf.nn.bidirectional_dynamic_rnn(
        fw,
        bw,
        inputs=inputs,
        dtype=tf.float32
    )

    #   [batch_size, sequence_length, 2 * hidden_size]
    output = tf.concat(output, axis=2)

    #   W * X + B
    #   [batch_size, sequence_length, 2 * hidden_size] -> [batch_size, sequence_length, attention_size]
    I = tf.layers.dense(inputs=output, units=attention_size, activation=tf.tanh)

    V = tf.get_variable(name='v_omega', shape=[attention_size], dtype=tf.float32)

    #   [batch_size, sequence_length, attention_size]
    U = tf.multiply(I, V)
    #   [batch_size, sequence_length]
    U = tf.reduce_sum(U, axis=2)
    #   [batch_size, sequence_length]
    A = tf.nn.softmax(U, axis=1)

    #   multiply is [batch_size, sequence_length, 2 * hidden_size] * [batch_size, sequence_length, 1]
    #   multiply = [batch_size, sequence_length, 2 * hidden_size]
    #   reduce_sum = [batch_size, 2 * hidden_size]
    C = tf.reduce_sum(tf.multiply(output, tf.expand_dims(A, -1)), axis=1)
    return C, A

在拿到 $r n n$ 的输出结果 $o u t p u t$ 之后，我们就拿到了在上面公式中的 $h$

I = tf.layers.dense(inputs=output, units=attention_size, activation=tf.tanh)

上面这行代码，就是公式中的
$tanh(W_1h + W_2d_t)$
只不过没有了 $d_t$ ,可以改为：
$tanh(W_1h + b)$

	V = tf.get_variable(name='v_omega', shape=[attention_size], dtype=tf.float32)

    #   [batch_size, sequence_length, attention_size]
    U = tf.multiply(I, V)

经过上面的两行代码,就得到了 $u^t$ ，而这也是我们在图片中 $m a t c h$ 后的结果。
$u^t=v^Ttanh(W_1h + b)$

	#   [batch_size, sequence_length]
    U = tf.reduce_sum(U, axis=2)
    #   [batch_size, sequence_length]
    A = tf.nn.softmax(U, axis=1)

经过 $s o f t m a x$ 之后，就拿到了我们的attention结果，也就是图片中的 $S_t$ ,对应着公式中的：
$a^t=softmax(u^t)$
最后将 $a t t e n t i o n$ 和 $h$ 做加权并求和。

	#   multiply is [batch_size, sequence_length, 2 * hidden_size] * [batch_size, sequence_length, 1]
    #   multiply = [batch_size, sequence_length, 2 * hidden_size]
    #   reduce_sum = [batch_size, 2 * hidden_size]
    C = tf.reduce_sum(tf.multiply(output, tf.expand_dims(A, -1)), 
    # C = tf.reduce_mean(tf.multiply(output, tf.expand_dims(A, -1)), axis=1)axis=1)

对应公式中的：
$c^t = \sum_t^{L}a^th_l$

最后的 $C$ ，就是我们对一个输入的 $v e c t o r$ 表示，不同的输入 $x_t$ 贡献着不一样的权重。
上面的就是 $a t t e n t i o n$ 在分类中的 $d e m o$ 展示。
最后再提一句，在 $e n c o d e r$ - $d e c o d e r$ 中,我们可以拿到 $r n n$ 的输出也就是图片中的 $h_t$ 和 $c_0$ ，通过 $h_t$ 和 $c_0$ （ $c_0$ 也就是 $d e c o d e r$ 的初始状态）计算 $a t t e n t i o n$ 之后，我们能拿到 $C$ （也就是下面图片中的 $X_{d-1}$ ）,在 $d e c o d e r$ 的时候，会把 $X_{d-1}$ 和 $c_0$ 作为解码中的第一次输出，从而得到 $c_1$ ,然后通过 $c_1$ 以及 $h_t$ 得到下一步的输入，以此迭代到输出结束为止。

看图加深下理解：

这时候的 $a t t e n t i o n$ 就和 $B a h d a n a u A t t e n t i o n$ 中的差不多啦。

当然 $a t t e n t i o n$ 还是有很多变体的，主要是在 $m a t c h$ 的过程中有不同。
在 LuongAttention 中， $m a t c h$ 操作是这样的:
$u^t=d_tWh$
$a^t=softmax(u^t)$
$c^t = \sum_t^{L}a^th_l$

这两类 $a t t e n t i o n$ 也就是我们经常说的加法 $a t t e n t i o n$ 和乘法 $a t t e n t i o n$ 了。

这个实例中我们用的是 $g l o b a l$ $a t t e n t i o n$ ,也就是对所有的输入 $X_t$ 进行了 $a t t e n t i o n$ 的操作。还有一种 $l o c a l$ $a t t e n t i o n$ 的操作，是在随机窗口内做 $a t t e n t i o n$ 操作，减少了计算量，区别在于关注的是所有 $e n c o d e r$ 状态还是部分 $e n c o d e r$ 状态。
具体请看 $L u o n g A t t e n t i o n$ 中的详细介绍。