Attention基础+代码简单实现+transformer

最新推荐文章于 2024-06-23 10:16:53 发布

MoonLer

最新推荐文章于 2024-06-23 10:16:53 发布

阅读量2.4k

点赞数 5

分类专栏： deeplearning

本文链接：https://blog.csdn.net/qq_40240102/article/details/96872780

版权

deeplearning 专栏收录该内容

53 篇文章 6 订阅

订阅专栏

个人理论学习路程

以下顺序有先后区分

[1]个人感觉最经典的一篇入门文章：

https://zhuanlan.zhihu.com/p/37601161

[2]Attetion计算稍微详细（又不太数学化的文章）：

https://www.jianshu.com/p/c94909b835d6

（上面两篇基础知识就已经够了，下面为自己用来科普的文章，都是粗看）

[3]RNN和LSTM弱！爆！了！注意力模型才是王道（文章角度不错，但是立场大家自己斟酌就好了）

https://www.jiqizhixin.com/articles/2018-05-03-8?from=synced&keyword=注意力模型

[4]一文带你速览分层注意力网络在文本分类中的应用

https://www.jiqizhixin.com/articles/2019-06-04-3?from=synced&keyword=注意力模型

[5]NAACL 2019论文独特视角|纠正归因谬误：注意力没有解释模型

https://www.jiqizhixin.com/articles/2019-06-11-4?from=synced&keyword=注意力模型

[6]从各种注意力机制窥探深度学习在NLP中的神威（科普文）

https://www.jiqizhixin.com/articles/2018-10-08-12

[7]Attetion内容的扩展（分类，各种形式,个人感觉比较乱，但是挺全的）

https://zhuanlan.zhihu.com/p/31547842

[8]Attention模型方法综述 | 多篇经典论文解读(个人没需求，没怎么细看)

https://www.jiqizhixin.com/articles/2018-06-11-16

代码

代码介绍：

1. 通过简单的双向GRU+Attetion+全连接层，实现情感二分类。

（主要个人想熟悉下Attetion的代码）

2.学习完以后发现代码不是标准的（编码，解码）网络，即seq2seq网络，因此它的attetion代码说实话，有点怪，极其简单，但意思是那个意思。（更新注：后面会给出个人学习Transformer的代码和资料）

3.代码整体逻辑很清晰，注释也还可以，感觉可以拿来当tensorflow项目代码的基础模板。

单层双向GRU

from tensorflow.python.ops.rnn import bidirectional_dynamic_rnn as bi_rnn

# (Bi-)RNN layer(-s)
'''
    一个tuple(outputs, outputs_states), 其中,outputs是一个tuple(outputs_fw, outputs_bw)
'''
rnn_outputs, _ = bi_rnn(GRUCell(HIDDEN_SIZE), GRUCell(HIDDEN_SIZE),
                        inputs=batch_embedded, sequence_length=seq_len_ph, dtype=tf.float32)

Attention

1.下面公式为代码内Attetion的公式的变形。

2.下面代码难点在与多维读矩阵乘法，逻辑很通俗，关键是维度变化。

在这里插入图片描述

# Trainable parameters
    w_omega = tf.Variable(tf.random_normal([hidden_size, attention_size], stddev=0.1))
    b_omega = tf.Variable(tf.random_normal([attention_size], stddev=0.1))
    u_omega = tf.Variable(tf.random_normal([attention_size], stddev=0.1))

    with tf.name_scope('v'):
        # Applying fully connected layer with non-linear activation to each of the B*T timestamps;
        #  the shape of `v` is (B,T,D)*(D,A)=(B,T,A), where A=attention_size
        v = tf.tanh(tf.tensordot(inputs, w_omega, axes=1) + b_omega)

    # For each of the timestamps its vector of size A from `v` is reduced with `u` vector
    vu = tf.tensordot(v, u_omega, axes=1, name='vu')  # (B,T) shape
    alphas = tf.nn.softmax(vu, name='alphas')         # (B,T) shape (batch,time)

    # Output of (Bi-)RNN is reduced with attention vector; the result has (B,D) shape
    # *这里是点乘
    output = tf.reduce_sum(inputs * tf.expand_dims(alphas, -1), 1)

    if not return_alphas:
        return output
    else:
        return output, alphas

有意思的用法

这转换函数应该很有用

    if time_major:
        # (T,B,D) => (B,T,D)
        inputs = tf.array_ops.transpose(inputs, [1, 0, 2])

embeddings层动态训练写法：

# Embedding layer
with tf.name_scope('Embedding_layer'):
    
    embeddings_var = tf.Variable(tf.random_uniform([vocabulary_size, EMBEDDING_DIM], -1.0, 1.0), trainable=True)
    tf.summary.histogram('embeddings_var', embeddings_var)
    batch_embedded = tf.nn.embedding_lookup(embeddings_var, batch_ph)

embedding层使用预训练好词向量，静态加载的写法(以前自己用过，对比加上)

（常量：用于存储一些不变的数值，在计算图创建的时候，调用初始化方法时，直接保存在计算图中）

self.embedding = tf.get_variable("embeddings", shape=[self.config.title_vocab_size, self.config.embedding_size],
                                             initializer=tf.constant_initializer(self.config.title_pre_trianing))

tf.squeeze() 用于压缩张量中为1的轴

y_hat = tf.squeeze(y_hat)

代码连接

（1）简单的分类Attention:

https://github.com/ilivans/tf-rnn-attention

（2）Transformer学习（论文+视频笔记+论文代码）
（本人代码并未精读，只是将论文的网络怎样实现中，个人存在的疑惑看了一遍，也并未实际去train，下面会给出原连接，原项目有预训练好的，but需要翻墙下载）
https://github.com/MaybeWeCan/Transformer_learn
（3）参考代码原项目连接：
https://github.com/Kyubyong/transformer

END

本人最近准备将RNN思路重新顺一遍，按照：
RNN-GRU-LSTM-Bi_lstm-Attention-self_attention-transformer->bert的路线走一遍，Transformer无论怎样单写一篇博客不过分，但是如上文Attention理论部分贴链接自己却不写，自己对着博客学，上限永远超不过参考的博客，自己另写感觉没有意义，所以也就有了这次补充更新。

MoonLer

关注

5
点赞
踩
20

收藏

觉得还不错? 一键收藏
0
评论
Attention基础+代码简单实现+transformer

个人理论学习路程以下顺序有先后区分[1]个人感觉最经典的一篇入门文章：https://zhuanlan.zhihu.com/p/37601161[2]Attetion计算稍微详细（又不太数学化的文章）：https://www.jianshu.com/p/c94909b835d6（上面两篇基础知识就已经够了，下面为自己用来科普的文章，都是粗看）[3]RNN和LSTM弱！爆！了！注意...
复制链接

扫一扫

专栏目录