对RNN用于Seq2Seq任务以及注意力机制的理解

from IPython.display import Image
%matplotlib inline

6.5使用Tensorflow构建字符级别(character-level) 语言模型

在我们现在将要构建的模型中,输入是一个文本文档,我们的目标是开发一个能够生成与输入文档样式相似的新文本的模型。这种输入的例子是使用特定编程语言的书籍或计算机程序。在字符级语言建模中,输入被分解为一系列字符,这些字符一次一个字符地输入我们的网络。该网络将结合先前看到的字符的记忆来处``理每个新字符,以预测下一个字符。下图显示了字符级语言建模的示例(请注意,EOS代表\“End of Sequence\”):

Image(filename='images/16_11.png', width=700)

在这里插入图片描述

可以将该实现分为三个独立的步骤:

  • 准备数据;

  • 构建RNN模型;

  • 以及执行下一个字符预测和采样以生成新文本。

6.5.1数据预处理

! curl -O http://www.gutenberg.org/files/1268/1268-0.txt
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
import numpy as np


## Reading and processing text
with open('1268-0.txt', 'r', encoding='utf-8') as fp:
    text=fp.read()
    
start_indx = text.find('THE MYSTERIOUS ISLAND')
end_indx = text.find('End of the Project Gutenberg')
print(start_indx, end_indx)

text = text[start_indx:end_indx]
char_set = set(text)
print('Total Length:', len(text))
print('Unique Characters:', len(char_set))
567 1112917
Total Length: 1112350
Unique Characters: 80
# 下图展示了将hello和world进行转换的示例
Image(filename='images/16_12.png', width=700)

在这里插入图片描述

chars_sorted = sorted(char_set)
char2int = {ch:i for i,ch in enumerate(chars_sorted)}
char_array = np.array(chars_sorted)

text_encoded = np.array(
    [char2int[ch] for ch in text],
    dtype=np.int32)

print('Text encoded shape: ', text_encoded.shape)

print(text[:15], '     == Encoding ==> ', text_encoded[:15])
print(text_encoded[15:21], ' == Reverse  ==> ', ''.join(char_array[text_encoded[15:21]]))
Text encoded shape:  (1112350,)
THE MYSTERIOUS       == Encoding ==>  [44 32 29  1 37 48 43 44 29 42 33 39 45 43  1]
[33 43 36 25 38 28]  == Reverse  ==>  ISLAND

对于文本生成任务,可以将其描述为一个分类任务。下图中的左边部分内容可以视为输入;为了生成新的文本,我们的目标是设计一个模型,其基于给定的输入

序列,预测后续的字符,这里的输入序列是一个不完整的文本。

考虑到我们有80个唯一的字符,则此问题将为多分类任务;

Image(filename='images/16_13.png', width=700)

在这里插入图片描述

import tensorflow as tf


ds_text_encoded = tf.data.Dataset.from_tensor_slices(text_encoded)

for ex in ds_text_encoded.take(5):
    print('{} -> {}'.format(ex.numpy(), char_array[ex.numpy()]))
44 -> T
32 -> H
29 -> E
1 ->  
37 -> M
seq_length = 40
chunk_size = seq_length + 1

ds_chunks = ds_text_encoded.batch(chunk_size, drop_remainder=True)

## inspection:
for seq in ds_chunks.take(1):
    input_seq = seq[:seq_length].numpy()
    target = seq[seq_length].numpy()
    print(input_seq, ' -> ', target)
    print(repr(''.join(char_array[input_seq])), 
          ' -> ', repr(''.join(char_array[target])))
[44 32 29  1 37 48 43 44 29 42 33 39 45 43  1 33 43 36 25 38 28  1  6  6
  6  0  0  0  0  0 40 67 64 53 70 52 54 53  1 51]  ->  74
'THE MYSTERIOUS ISLAND ***\n\n\n\n\nProduced b'  ->  'y'

从长度为1的序列,即一个字母开始,我们可以基于这种多分类任务迭代生成新的文本,如下图所示:

Image(filename='images/16_14.png', width=700)

在这里插入图片描述

为了在TensorFlow中实现文本生成任务,让我们首先将序列长度剪成40。这意味着输入张量 x \boldsymbol{x} x由40个tokens组成。

实际上,序列长度会影响生成文本的质量。较长的序列可以产生更有意义的句子。然而,对于较短的序列,该模型可能侧重于正确捕获单个单词,而忽略大部分上下文。

虽然较长的序列通常会产生更有意义的句子,但如上所述,对于较长的序列,RNN模型在捕捉长期依赖关系时会有问题。

因此,在实践中,找到序列长度的最佳点是一个超参数优化问题,必须根据经验进行评估。在这里,将选择40,因为它提供了一个很好的权衡。

从上图中可以看出,输入 x x x和目标 y y y偏移了一个字符。因此,我们将把文本分成大小为41的块:前40个字符将构成输入序列 x x x,最后 40 40 40个元素将构成目标序列 y y y

## define the function for splitting x & y
def split_input_target(chunk):
    input_seq = chunk[:-1] # 不包含最后一个元素
    target_seq = chunk[1:] # 不含第一个元素
    return input_seq, target_seq

ds_sequences = ds_chunks.map(split_input_target)

## inspection:
for example in ds_sequences.take(2):
    print(' Input (x):', repr(''.join(char_array[example[0].numpy()])))
    print('Target (y):', repr(''.join(char_array[example[1].numpy()])))
    print()
 Input (x): 'THE MYSTERIOUS ISLAND ***\n\n\n\n\nProduced b'
Target (y): 'HE MYSTERIOUS ISLAND ***\n\n\n\n\nProduced by'

 Input (x): ' Anthony Matonak, and Trevor Carlson\n\n\n\n'
Target (y): 'Anthony Matonak, and Trevor Carlson\n\n\n\n\n'

将数据集划分为小批量,首先shuffle数据样本,并将输入划分为小批量。

# Batch size
BATCH_SIZE = 64
BUFFER_SIZE = 10000

tf.random.set_seed(1)
ds = ds_sequences.shuffle(BUFFER_SIZE).batch(BATCH_SIZE)# drop_remainder=True)

ds
<BatchDataset shapes: ((None, 40), (None, 40)), types: (tf.int32, tf.int32)>

6.5.2构建字符级(character-level) RNN 模型

def build_model(vocab_size, embedding_dim, rnn_units):
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(vocab_size, embedding_dim),
        tf.keras.layers.LSTM(
            rnn_units, return_sequences=True),
        tf.keras.layers.Dense(vocab_size)
    ])
    return model


charset_size = len(char_array)
embedding_dim = 256
rnn_units = 512

tf.random.set_seed(1)

model = build_model(
    vocab_size = charset_size,
    embedding_dim=embedding_dim,
    rnn_units=rnn_units)

model.summary()
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, None, 256)         20480     
_________________________________________________________________
lstm_1 (LSTM)                (None, None, 512)         1574912   
_________________________________________________________________
dense_1 (Dense)              (None, None, 80)          41040     
=================================================================
Total params: 1,636,432
Trainable params: 1,636,432
Non-trainable params: 0
_________________________________________________________________

LSTM layer设置了输出shape为(None, None, 512), 这意味着LSTM的输出为3阶;

其中,第一维代表batches的个数、第二维代表的是输出序列的长度、最后一维代表的是隐层单元的个数;

model.compile(
    optimizer='adam', 
    loss=tf.keras.losses.SparseCategoricalCrossentropy(
        from_logits=True
    ))

model.fit(ds, epochs=20)
Train for 424 steps
Epoch 1/20
424/424 [==============================] - 15s 35ms/step - loss: 2.3011
Epoch 2/20
424/424 [==============================] - 9s 22ms/step - loss: 1.7329
Epoch 3/20
424/424 [==============================] - 9s 22ms/step - loss: 1.5340
Epoch 4/20
424/424 [==============================] - 9s 22ms/step - loss: 1.4204
Epoch 5/20
424/424 [==============================] - 9s 22ms/step - loss: 1.3476: 0s -
Epoch 6/20
424/424 [==============================] - 9s 22ms/step - loss: 1.2975
Epoch 7/20
424/424 [==============================] - 9s 22ms/step - loss: 1.2596: 0s - loss: 1.
Epoch 8/20
424/424 [==============================] - 10s 23ms/step - loss: 1.2286
Epoch 9/20
424/424 [==============================] - 10s 23ms/step - loss: 1.2030
Epoch 10/20
424/424 [==============================] - 9s 22ms/step - loss: 1.1820
Epoch 11/20
424/424 [==============================] - 9s 22ms/step - loss: 1.1620:  - ETA:  - ETA - ETA:
Epoch 12/20
424/424 [==============================] - 9s 22ms/step - loss: 1.1446
Epoch 13/20
424/424 [==============================] - 9s 22ms/step - loss: 1.1282
Epoch 14/20
424/424 [==============================] - 9s 22ms/step - loss: 1.1127
Epoch 15/20
424/424 [==============================] - 9s 22ms/step - loss: 1.0985
Epoch 16/20
424/424 [==============================] - 9s 22ms/step - loss: 1.0846
Epoch 17/20
424/424 [==============================] - 9s 22ms/step - loss: 1.0717
Epoch 18/20
424/424 [==============================] - 9s 22ms/step - loss: 1.0589
Epoch 19/20
424/424 [==============================] - 10s 22ms/step - loss: 1.0461 7s - loss: 1.04 - ETA: - ETA: 5s - loss: 1.04 - ETA
Epoch 20/20
424/424 [==============================] - 9s 22ms/step - loss: 1.0338: 2s - loss - ETA: - ETA: 0s 





<tensorflow.python.keras.callbacks.History at 0x22b12ed4748>

6.5.3评估阶段: 生成新的文本

上面训练的模型为每个唯一的字符返回80个logits,实际上也可以通过softmax函数将其转换为概率。为了预测序列的下一个字符,我们可以简单地选择拥有

最高logit 值的字符,这就等同于选择了具有最大概率值的字符。

然而,并不能直接选择具有最大概率值的字符,因为这样会直接导致模型每次生成的序列都是相同的。这里可以通过tf.random.categorical()对输出进行随机采样

tf.random.set_seed(1)

logits = [[1.0, 1.0, 1.0]]
print('Probabilities:', tf.math.softmax(logits).numpy()[0])

samples = tf.random.categorical(
    logits=logits, num_samples=10)
tf.print(samples.numpy())
Probabilities: [0.33333334 0.33333334 0.33333334]
array([[0, 0, 1, 2, 0, 0, 0, 0, 1, 0]])
tf.random.set_seed(1)

logits = [[1.0, 1.0, 3.0]]
print('Probabilities:', tf.math.softmax(logits).numpy()[0])

samples = tf.random.categorical(
    logits=logits, num_samples=10)
tf.print(samples.numpy())
Probabilities: [0.10650698 0.10650698 0.78698605]
array([[2, 2, 0, 2, 2, 2, 2, 2, 1, 2]], dtype=int64)
def sample(model, starting_str, 
           len_generated_text=500, 
           max_input_length=40,
           scale_factor=1.0):
    encoded_input = [char2int[s] for s in starting_str]
    encoded_input = tf.reshape(encoded_input, (1, -1))

    generated_str = starting_str

    model.reset_states()
    for i in range(len_generated_text):
        logits = model(encoded_input)
        logits = tf.squeeze(logits, 0)

        scaled_logits = logits * scale_factor
        new_char_indx = tf.random.categorical(
            scaled_logits, num_samples=1)
        
        new_char_indx = tf.squeeze(new_char_indx)[-1].numpy()    

        generated_str += str(char_array[new_char_indx])
        
        new_char_indx = tf.expand_dims([new_char_indx], 0)
        encoded_input = tf.concat(
            [encoded_input, new_char_indx],
            axis=1)
        encoded_input = encoded_input[:, -max_input_length:]

    return generated_str

tf.random.set_seed(1)
print(sample(model, starting_str='The island'))
The island was extloted out. Not a
sign landing loudless, formed the
rocks,
at the nettom of the rock possible studing works above his hollow mistaken. It was evident that neither sixal valutions of carbone
coast, listened.

About Pencroft.

“I have me shared the experiment, Harding admitted to conceal him. The excettlement waslingly inscothed of his heads and gass,
succeeded from her chest, and having bears one of their
guard over.
They examining versed. The re-fised with the corral.

For they had found 
  • Predictability vs. randomness
logits = np.array([[1.0, 1.0, 3.0]])

print('Probabilities before scaling:        ', tf.math.softmax(logits).numpy()[0])

print('Probabilities after scaling with 0.5:', tf.math.softmax(0.5*logits).numpy()[0])

print('Probabilities after scaling with 0.1:', tf.math.softmax(0.1*logits).numpy()[0])
Probabilities before scaling:         [0.10650698 0.10650698 0.78698604]
Probabilities after scaling with 0.5: [0.21194156 0.21194156 0.57611688]
Probabilities after scaling with 0.1: [0.31042377 0.31042377 0.37915245]
tf.random.set_seed(1)
print(sample(model, starting_str='The island', 
             scale_factor=2.0))
The island was in the lad tide, and the last surprised the colonists were corrected the convicts would have been supposed that the hundred miles from the storm than the settlers resulated the colonists of the corral, where he was the
colonists were delighted to continue the present to the northern part of the brig it is really likely to contract the shore. The colonists felt it was such a shot. I don’t believe that the powder and the animals were overcome the rocks. The boat was extinguished and shot with
tf.random.set_seed(1)
print(sample(model, starting_str='The island', 
             scale_factor=0.5))
The island had egbloty hollideken,
Heall? ” would “make cif-” retrane,
myrefrejy. Madulities, walded ago from thes open. he Evidepth Papolor Jup-Docks. Itsig heart;
And Hern Jup rasilarly metharyioge’!

“We can, at Fawl,
rassed
barking,-- “:hocks,” del as Lonjeckies nglt good’s bow’s,! Therefore, nicedful powdenel! they relievel Enutencribt,
frosed us?s, smeting, whnee
were air
of her mature mill dwemby agacas:-illy ourselmen,” extraiged ever. MTanvey. Lalvers were well
bry Joac. We arl, Harding quite upi

6.6使用Transformer模型进行语言理解任务

Transformer架构基于注意力机制,或者准确地说应该为:self-attention mechanism。注意力机制的使用,意味着模型可以更加关注输入序列中与

目标更加相关的部分;

6.6.1理解 self-attention 机制

首先是一种基础的版本理解:A basic version of self-attention

假设,我们有长度为 T T T的输入序列 x ( 0 ) , x ( 1 ) , … , x ( T ) \boldsymbol{x}^{(0)}, \boldsymbol{x}^{(1)}, \ldots, \boldsymbol{x}^{(T)} x(0),x(1),,x(T),相对应的,输出序列为:

o ( 0 ) , o ( 1 ) , … , o ( T ) \boldsymbol{o}^{(0)}, \boldsymbol{o}^{(1)}, \ldots, \boldsymbol{o}^{(T)} o(0),o(1),,o(T)

这些序列 x ( t ) \boldsymbol{x}^{(t)} x(t) o ( t ) \boldsymbol{o}^{(t)} o(t)都是向量,且size均为 d d d,即 x ( t ) ∈ R d \boldsymbol{x}^{(t)} \in R^{d} x(t)Rd

对于seq2seq任务,self-attention的目标是对输出序列中的每个元素与输入元素之间的依赖关系进行建模

实际上,其计算过程图示如下:

Image(filename='images/16_15.png', width=700)

在这里插入图片描述

# 或者根据李宏毅深度学习中讲到的,更形式化的表示---图示如下:
Image(filename='images/01.jpg', width=700)

在这里插入图片描述

Image(filename='images/02.jpg', width=700)

在这里插入图片描述

Image(filename='images/03.jpg', width=700)

在这里插入图片描述

Image(filename='images/04.jpg', width=700)

在这里插入图片描述

Image(filename='images/05.jpg', width=700)

在这里插入图片描述

Image(filename='images/06.jpg', width=700)

在这里插入图片描述

Image(filename='images/07.jpg', width=700)

在这里插入图片描述

Image(filename='images/08.jpg', width=700)

在这里插入图片描述

Image(filename='images/09.jpg', width=700)

在这里插入图片描述

Image(filename='images/10.jpg', width=700)

在这里插入图片描述

Image(filename='images/11.jpg', width=700)

在这里插入图片描述

6.6.2多头注意力机制(Multi-head attention)

Image(filename='images/16_16.png', width=700)

在这里插入图片描述

Image(filename='images/12.jpg', width=700)

在这里插入图片描述

Image(filename='images/13.jpg', width=700)

在这里插入图片描述

Image(filename='images/14.jpg', width=700)

在这里插入图片描述

Image(filename='images/15.jpg', width=700)

在这里插入图片描述

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值