seq2seq + attention

最新推荐文章于 2023-11-22 22:21:47 发布

自学AI的鲨鱼儿

最新推荐文章于 2023-11-22 22:21:47 发布

阅读量1.3k

点赞数 2

文章标签：深度学习

本文链接：https://blog.csdn.net/qq_16555103/article/details/90167432

版权

AI_NLP 专栏收录该内容

20 篇文章 9 订阅

订阅专栏

1、思考几个问题：
    ① 为什么解码器 一般来说 需要与 编码器的 hidden_size 相同呢？

2、seq2seq + attention 注意的几个问题：
    ① 如果编码器 的 RNNCell 是LSTM ，那么它输出的高阶向量C（递归状态）的维度 = 2*hidden_size，而GRU 的C向量的
      维度 = hidden_size

一、seq2seq 多层cell的基本结构

seq2seq 的应用场景

1、机器翻译
2、自动对话机器人
3、文档摘要自动生成
4、图片描述自动生成

二、attention 的作用

三、根据TensorFlow源码分析seq2seq+attention（LSTM为例）

1、encoder

    # Encoder.
    encoder_cell = copy.deepcopy(cell)
    encoder_cell = core_rnn_cell.EmbeddingWrapper(
        encoder_cell,
        embedding_classes=num_encoder_symbols,
        embedding_size=embedding_size)
    #进行类似特征提取，得出编码器的状态和输出
    encoder_outputs, encoder_state = rnn.static_rnn(
        encoder_cell, encoder_inputs, dtype=dtype)

    # First calculate a concatenation of encoder outputs to put attention on.
    top_states = [
        array_ops.reshape(e, [-1, 1, cell.output_size]) for e in encoder_outputs
    ]
    attention_states = array_ops.concat(top_states, 1)
    #batch_size*H*W*channel

2、decoder

（1）decoder 上一时刻输出 hi-1 与当前时刻输入yi-1 之间的转化关系

1、注意：
    1.1 以上这个过程只是 在预测的时候 才会执行该操作，由于预测的过程没有 label，解码器的输入 == 上一时刻解码器的输出
    1.2 seq2seq 在进行训练的时候，由于 存在label，所以当前时刻解码器的输入跟 label 有关，不再是上一时刻解码器
        的输出了，有两种方式：
            ① 使用 类似 抛硬币 概率选择输入，在 上一时刻解码器的输出与真实label中选择
            ② 直接采用 真实的 label 进行输入（为了纠错训练时解码器的输出，TensorFlow源码采用的是该方法）

（2）decoder 加入 attention 机制的过程（TensorFlow）

注意：以下结构是TensorFlow内部源码的Attention的工作机制，与论文中的soft attention、hard attention、
local attention、self attention 的结构略有差异

https://blog.csdn.net/qq_16555103/article/details/99760588 --- soft attention、hard attention、 local attention

结合源码分析

注意要点：
    decoder 输入为 GO 的时刻，此时的attention向量C 是 初始化为 0向量 来计算的，即将 0向量与 GO合并做 Linear 得到真
        正的输入x，再做 .cell(x,state)，之后的计算方式与上述一样

（3）上述 (2) 中 attention 状态 TensorFlow的计算方式

理论公式：

求解value（TensorFlow LSTM 中 value向量 = key向量 = 每个时刻编码器的输出 ht）的权重向量 α

求解 attention 向量

https://blog.csdn.net/sdu_hao/article/details/88167962 ------------------- 权重score 向量α 的计算方法

根据TensorFlow 源码分析 attention 向量的计算方式

================== TensorFlow 计算 attention 的源码
with variable_scope.variable_scope(
      scope or "attention_decoder", dtype=dtype) as scope:
    dtype = scope.dtype

    batch_size = array_ops.shape(decoder_inputs[0])[0]  # Needed for reshaping.
    #attention_state是编码器传递的,attn_length句子的长度
    attn_length = attention_states.get_shape()[1].value
    if attn_length is None:
      attn_length = array_ops.shape(attention_states)[1]   #5句子长度
    attn_size = attention_states.get_shape()[2].value      #512字向量长度

    # To calculate W1 * h_t we use a 1-by-1 convolution, need to reshape before.
    hidden = array_ops.reshape(attention_states,    
                               [-1, attn_length, 1, attn_size])
    #(?,5,1,512)   1：为了卷积加的维度[batch_size,H,W,Channel]
    hidden_features = []
    v = []
    #attention_vec_size 512
    attention_vec_size = attn_size  # Size of query vectors for attention.   k为PPT中的U
    for a in xrange(num_heads):
      k = variable_scope.get_variable("AttnW_%d" % a,
                                      [1, 1, attn_size, attention_vec_size])
      hidden_features.append(nn_ops.conv2d(hidden, k, [1, 1, 1, 1], "SAME"))
      #hidden_features为PPT中U*(h1,h2,...)，attnV为PPT中的v
      v.append(
          variable_scope.get_variable("AttnV_%d" % a, [attention_vec_size]))

    state = initial_state    #编码器递归最后一个隐藏状态作为解码器的初始递归状态

    def attention(query):     #query是(递归状态:(state))，两层模型
      """Put attention masks on hidden using hidden_features and query."""
      ds = []  # Results of attention reads will be stored here.
      if nest.is_sequence(query):  # If the query is a tuple, flatten it.
        query_list = nest.flatten(query)
        for q in query_list:  # Check that ndims == 2 if specified.
          ndims = q.get_shape().ndims
          if ndims:
            assert ndims == 2
        query = array_ops.concat(query_list, 1)
      for a in xrange(num_heads):
        with variable_scope.variable_scope("Attention_%d" % a):
          y = linear(query, attention_vec_size, True)             #pre_hiddent_state*W
          y = array_ops.reshape(y, [-1, 1, 1, attention_vec_size])
          # Attention mask is a softmax of v^T * tanh(...).
          s = math_ops.reduce_sum(v[a] * math_ops.tanh(hidden_features[a] + y),
                                  [2, 3])
          a = nn_ops.softmax(s)
          # Now calculate the attention-weighted vector d.
          d = math_ops.reduce_sum(
              array_ops.reshape(a, [-1, attn_length, 1, 1]) * hidden, [1, 2])
          ds.append(array_ops.reshape(d, [-1, attn_size]))      #ds表示大框attention
      return ds

2、seq2seq + attention 过程小结

test():
1. s2s.py --> create_model()
   1). s2s_model.py --> 实例化S2SModel对象：a. 定义了两层BasicLSTMCell，维度512, 组成cell；
                                        b. 定义了embedding的权重矩阵，定义了输出的权重矩阵
                                        c. 根据最大桶设置placeholder的长度，同时放到编码和解码器输入链表中
                                        d. 解码器的预测输出就是targets
                                        e. 三大关键代码：sampled_loss编写loss函数；seq2seq_f包含seq2seq和attention;
                                                         model_with_buchets内部每个桶用到了seq2seq_f和sampled_loss

   2).与3大关键代码相关：a.首先执行的是model_with_buckets[seq2seq.py是源代码]：组装了seq2seq_f
                        a.其次执行的是seq2seq_f-->
                          embedding_attention_seq2seq[seq2seq.py]:1. 对输入进行embedding，实际是加在cell上，与cell绑定，
                                                                     rnn.static_rnn构建了编码器
                                                                  2. 构建解码器，将全链接输出与最后一个解码器cell绑定，
                                                                  --->
                        b. embedding_attention_decoder[seq2seq.py]:1. 将解码器的embedding需要的w变量初始化
                                                                   2. 将解码器的预测6865输出映射到vc，找出相应的字进行embedding，
                                                                      作为下一个时刻解码器的输入
                                                                      attention出现--->

                        c.attention_decoder[seq2seq.py]:1.解码器的从输入到预测输出，在将该输出作为下一个时刻输入的运行过程
                    
                                                        2.attention机制是怎么引入的：
                                                           1.attn_length是句长、attn_size是字的向量长
                                                           2.[-1,attn_length,1,attn_size] 分别是句数(图片数),
                                                             句子长(H),无意义(w),句向量(通道)
                                                           3.hidden就是h1,h2,...
                    d. 最后：执行的是model_with_buckets[seq2seq.py是源代码]： 组装了sampled_loss[s2s_model.py]:
                       调用tf.nn.sampled_softmax_loss


训练与预测不同的地方：
	第一：seq2seq_f 的参数do_decode 在训练的时候输入 False，在预测的时候输入True，即更改 feed_previos 的 bool参数：
                         预测的时候  loop_function = _extract_argmax_and_embed(embedding, output_projection,update_embedding_for_previous) if feed_previous else None
                         训练的时候 loop_function 选取 解码器输入 真实target  进行输入，而不是选取 解码器上一个时刻 输出 作为输入。
                第二：训练过程多了一个 BP过程
                第三：模型的输出结果不同：
                          if not forward_only:
           		 return outputs[1], outputs[2], outputs[3:]      # 训练过程输出： 更新后的 w,b          w,b的梯度 delta（w,b）      loss（损失函数的大小）
                          else:
            		 return None, outputs[0], outputs[1:]       # 预测的返回值：None 、loss（预测时困惑度的评估）、seq2seq网络真实的输出结果

四、seq2seq 经验值

1、seq2seq 处理长句子
    小技巧：将源句子顺序颠倒后再输入Encoder 中，比如源句子为“A B C”，那么输入 Encoder 的顺序为 “C B A”，经过这
           样的处理后，取得了很大的提升。

五、基于seq2seq + attention 聊天系统优化

bug fix:   程序BUG
 1. /->tf.div  （ TensorFlow 中 除法 不要直接用  /）;    *->tf.mul
 
 change and optimize:
 1.gradient :    adamoptimizer(*)                                         seq2seq_model
 2.modify participle                                                
 3.learning_rate_decay_op                                                 seq2seq_model
 
 add:      优化
 1.swift cpu&gpu by the global_-->DEV_FLAG                                global_
 2.add dropout just for input;maybe you like drop output and change       seq2seq_model
 3.add epoch                                                              train(lib)
 4.add Chinese chat                                                       _*_
 5.add tensorboard loss-show                                              seq2seq_model
 6.add the 5th min loss point                                             lib/trian for model   # 保
                                                                       存loss下降 5个 局部最优点
 7.add stop early criteria                                                lib/trian      # earty_stop
 8.add L2 regularization                                                  seq2seq_enhance linear_function_enhance
 9.add the current loss point                                             --for training breakpoint

------------------------------------------------------------------ 常见的优化 --------------------------------------------------------------------------------------

 ********** 10.add stop_word                                                         _*_                             目的：可以提高模型的效果
          "呀 吗 吧 呢 呵 呃 呕 呗 呜 哎 唉 啊 啦 的 得 地 你 了 ， ？ ！ ! ? 、 。 , ~ ."
		原因：seq2seq 本质上是一个概率模型，模型偏向于频率较高的词汇，但是像 语气词、代词、标点 ...  这些词在 QA对话中 没有什么意义，但是频率很高，因此需要对其进行停止词。
	         注意：停止词通常只有对 Q 序列进行操作，而A 序列不进行操作，因此停止词不能用于VC表内部，应当在内存中对序列进行匹配，在输入网络之前去除。
                 tip：停用表用的是字典trie，字典 key hash，搜索速度较快。
			eg ：    Q  你明天还会来吗？                                        Q  明天还会来
		                        A  当然呀！                         >>>>>>>>>      A   当然呀！

*********** 11. stride ----> 2、3 【卷积核的步子、跳帧（语音独有）】                                               目的 ： 加速网络的运行速度    dfsmn
12. 卷积核的设计【3*5 ----> 11*21，将对称的设计为非对称】                                            目的：增加模型的抗噪声能力
13. 网络层数【VGG变体】
14. 网络结构【普通cnn ---> inception/resnet结构】                                                           也可能是调参  
15. Bilstm ---->row conv    # 思想：用 行卷积 代替 双向LSTM，加速模型速度                                                                                                deepspeech2
*********** 16. 使用jieba分词的embeding 效果好于 直接使用 字 embeding
*********** 17. 添加并调整 dropout 
*********** 18. sorted words cut                                             按照词频对词进行排序
 		optimizing:
 		1.cancel the redundant punctuation and right side
 		2.
 

 debug:   调参
 1. unit_size 2048    bath_size  256                                      OOM（程序持续运行，没有终止）
 2. unit_size (256->64) unit_size(2048->256)              
 3. lr(0.5->0.1) bath_size(64->10) unit_size(256->300) (split("[]/'?.")->split(" "))
 4. lr(0.1->0.001) bach_size(10->32) unit_size(300->500)                  En--overfitting
 5. lr(0.001) min_lr(0.00001) bach_size(10) unit_size(100)                Ch--fine

语料的预处理：
0. 爬取                              --------- 源：① 微信/QQ 聊天 ② 微博对话 ③ 百度知道、知乎..... ④ 电影对白、话剧对白
*********** 1. 去重                              ---------- QA 重复 或者 作用相近，例： 你好 与 你好！  等等                               方法：查询百度
*********** 2. 清洗--违禁词与长句        ---------- 网络上有 违禁词表（含有违禁词，删除QA）
3. 人工校对                                       ----------- 正规公司内部有专业校对人员
*********** 4. 数据扩增                      ------------- 仿真现实对话，用于模拟常见的真实回答 与 ‘艺术感’的感性回答  
                        原因：因为爬取的数据来源不同，对话运用的场景不同，根据业务进行数据扩增。
                              由于语料中可能存在 Q相似而答案比较突兀的回答，因为这些回答可能有之前对话的场景，
                              而模型预测的时候需要大概率回答正常的answer，小概率会‘艺术性’answer，这时
                              候就要对 QA进行语料扩增。如下所示：
			Q      你好                                                                     Q     你好
			A      你好 			                                                A      你怎么来这么早
                     扩增： 30倍（语料库中扩展30次，但是需要随机插入                
                                            到语料中，不可连续出现）		                  扩增：5倍		  
			常见对话 30倍 ....                                                           不常见话剧对话  5/10倍
           tip： 之所以要用‘艺术性’answer是为了让模型看起来不太‘呆滞’，具有人情味。

自学AI的鲨鱼儿

关注

2
点赞
踩
10

收藏

觉得还不错? 一键收藏
0
评论
seq2seq + attention

1、思考几个问题： ① 为什么解码器一般来说需要与编码器的 hidden_size 相同呢？2、seq2seq + attention 注意的几个问题： ① 如果编码器的 RNNCell 是LSTM ，那么它输出的高阶向量C（递归状态）的维度 = 2*hidden_size，而GRU 的C向量的维度 = hidden_size 一、seq2seq ...
复制链接

扫一扫