机器翻译模型Transformer代码详细解析

最新推荐文章于 2024-03-04 19:00:00 发布

WitsMakeMen

最新推荐文章于 2024-03-04 19:00:00 发布

阅读量2.7k

点赞数

分类专栏：算法学习机器学习&amp;深度学习机器学习&深度学习

算法学习同时被 3 个专栏收录

213 篇文章 6 订阅

订阅专栏

机器学习&深度学习

47 篇文章 0 订阅

订阅专栏

机器学习&深度学习

5 篇文章 0 订阅

订阅专栏

转自：https://blog.csdn.net/mijiaoxiaosan/article/details/74909076
谷歌一个月前发了一篇论文Attention is all you need,文中提出了一种新的架构叫做Transformer，用以来实现机器翻译。它抛弃了传统用CNN或者RNN的定式，取得了很好的效果，激起了工业界和学术界的广泛讨论。本人的另一篇博客也对改论文进行了一定的分析：对Attention is all you need 的理解。而在谷歌的论文发出不久，就有人用tensorflow实现了Transformer模型：A TensorFlow Implementation of the Transformer: Attention Is All You Need。这里我打算对该开源实现的代码进行细致的分析。
该实现相对原始论文有些许不同，比如为了方便使用了IWSLT 2016德英翻译的数据集，直接用的positional embedding，把learning rate一开始就调的很小等等，不过大同小异，主要模型没有区别。

该实现一共包括以下几个文件

hyperparams.py 该文件包含所有需要用到的参数
prepro.py 该文件生成源语言和目标语言的词汇文件。
data_load.py 该文件包含所有关于加载数据以及批量化数据的函数。
modules.py 该文件具体实现编码器和解码器网络
train.py 训练模型的代码，定义了模型，损失函数以及训练和保存模型的过程
eval.py 评估模型的效果
接下来针对每一个文件分别解析。
首先是hyperparams.py文件。
该实现所用到的所又的超参数都在这个文件里面。以下是该文件的所有代码：

class Hyperparams:
”’Hyperparameters”’
# data
source_train = ‘corpora/train.tags.de-en.de’
target_train = ‘corpora/train.tags.de-en.en’
source_test = ‘corpora/IWSLT16.TED.tst2014.de-en.de.xml’
target_test = ‘corpora/IWSLT16.TED.tst2014.de-en.en.xml’

# training
batch_size = 32 # alias = N
lr = 0.0001 # learning rate. In paper, learning rate is adjusted to the global step.
logdir = 'logdir' # log directory

# model
maxlen = 10 # Maximum number of words in a sentence. alias = T.
            # Feel free to increase this if you are ambitious.
min_cnt = 20 # words whose occurred less than min_cnt are encoded as <UNK>.
hidden_units = 512 # alias = C
num_blocks = 6 # number of encoder/decoder blocks
num_epochs = 20
num_heads = 8
dropout_rate = 0.1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
可以看出该部分没有什么特别难以理解的，定义了一些要使用的超参数以便以后使用。首先是源语言以及目标语言的训练数据和测试数据的路径，其次设定了batch_size的大小以及初始学习速率还有日志的目录,batch_size 在后续代码中即所谓的N，参数中常会见到。最后定义了一些模型相关的参数，maxlen为一句话里最大词的长度为10个，在其他代码中就用的是T来表示，你也可以根据自己的喜好将这个参数调大;min_cnt被设置为20，该参数表示所有出现次数少于min_cnt次的都会被当作UNK来处理;hidden_units设置为512，隐藏节点的个数，在代码中用C来表示。num_blocks和num_heads都是论文中提到的设定，epoch大小设置为20，此外还有dropout就不用多费口舌了。
以上就是该开源实现中超参数的设定，该部分到此为止，没有太多可以说的。

接下来看预处理的代码prepro.py ，该代码的作用是生成源语言和目标语言的词汇文件。
为了直观理解，首先看一下执行代码之后生成的词汇文件是啥样的，我这里截取了德语词汇文件的前几行：

1000000000
1000000000
~~1000000000~~
1000000000
die 85235
und 77082
der 56248
ist 51457
1
2
3
4
5
6
7
8
可以看出，文件把训练数据中出现的单词和其出现的次数做了统计，并且记录在生成的词汇文件中。第一列为单词，第二列为出现的次数。同时，设置了四个特殊的标记符号，把他们设定为出现次数很多放在文件的最前。
仍然是先贴代码。

from future import print_function
from hyperparams import Hyperparams as hp
import tensorflow as tf
import numpy as np
import codecs
import os
import regex
from collections import Counter

def make_vocab(fpath, fname):
”’Constructs vocabulary.

Args:
  fpath: A string. Input file path.
  fname: A string. Output file name.

Writes vocabulary line by line to `preprocessed/fname`
'''  
text = codecs.open(fpath, 'r', 'utf-8').read()
text = regex.sub("[^\s\p{Latin}']", "", text)
words = text.split()
word2cnt = Counter(words)
if not os.path.exists('preprocessed'): os.mkdir('preprocessed')
with codecs.open('preprocessed/{}'.format(fname), 'w', 'utf-8') as fout:
    fout.write("{}\t1000000000\n{}\t1000000000\n{}\t1000000000\n{}\t1000000000\n".format("<PAD>", "<UNK>", "<S>", "</S>"))
    for word, cnt in word2cnt.most_common(len(word2cnt)):
        fout.write(u"{}\t{}\n".format(word, cnt))

if name == ‘main‘:
make_vocab(hp.source_train, “de.vocab.tsv”)
make_vocab(hp.target_train, “en.vocab.tsv”)
print(“Done”)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
代码中make_vocab函数就是生成词汇文件的函数。该函数一共有两个参数，fpath表示输入文件的路径，具体而言就是训练数据，而另一个参数fname即要输出的词汇文件名。
该函数一行一行地将词汇写入到’preprocessed/fname’中。
可以注意到一开始使用codecs中的open函数来打开并读取文件的。那么这个和我们平常使用的open函数有什么区别呢？基本上在处理语言的时候都要在unicode这种编码上边搞，可以看到codecs.open的时候直接将文件爱呢转换为内部unicode，其中第三个参数就是源文件的编码格式。关于codecs具体可以参考python模块之codecs
。
读取文件之后用正则表达式对读入的数据进行了处理，sub函数用于替换字符串中的匹配项，一共有三个参数，将第三个参数所代表的字符串中等所有满足第一个参数示例的形式的字符都用第二个参数来代替。
接下来将读取的文本按照空白分割成words之后放入Counter进行计数，计数的结果类似于一个字典，key为词，value为出现的次数。然后创建爱呢保存预处理文件的目录。同样利用codecs李的open函数创建一个要输出的文件，首先将四个准备好的特殊词写入文件在开始的四行。然后利用most_common函数依词出现的频率将训练集中出现的词和其对应的计数一行一行写入文件。
分别用德语和英语文件作为参数运行该函数即可得到词汇文件。

接下来分析第三个文件data_load.py，该文件包含所有关于加载数据以及批量化数据的函数。还是先上代码。

from future import print_function
from hyperparams import Hyperparams as hp
import tensorflow as tf
import numpy as np
import codecs
import regex

def load_de_vocab():
vocab = [line.split()[0] for line in codecs.open(‘preprocessed/de.vocab.tsv’, ‘r’, ‘utf-8’).read().splitlines() if int(line.split()[1])>=hp.min_cnt]
word2idx = {word: idx for idx, word in enumerate(vocab)}
idx2word = {idx: word for idx, word in enumerate(vocab)}
return word2idx, idx2word
1
2
3
4
5
6
7
8
9
10
11
12
这里一部分一部分代码进行分析。
首先是load_de_vocab()函数。该函数的目的是给德语的每个词分配一个id并返回两个字典，一个是根据词找id，一个是根据id找词。函数直接利用codecs的open来读取之前在预处理的时候生成的词汇文件。注意这里读每行的时候去掉了那些出现次数少于hp.min_cnt(根据设定为20)的词汇。读完之后有一个词汇列表。然后便利该列表的枚举enumerate(vocab)生成词和其对应id的两个字典。
接下来是load_en_vocab()函数的代码：

def load_en_vocab():
vocab = [line.split()[0] for line in codecs.open(‘preprocessed/en.vocab.tsv’, ‘r’, ‘utf-8’).read().splitlines() if int(line.split()[1])>=hp.min_cnt]
word2idx = {word: idx for idx, word in enumerate(vocab)}
idx2word = {idx: word for idx, word in enumerate(vocab)}
return word2idx, idx2word
1
2
3
4
5
该函数和之前的生成德语word/id字典的函数一样，只不过生成的是英语的word/id字典，方法都一样，不用多说。

接下来是creat_data函数。

def create_data(source_sents, target_sents):
de2idx, idx2de = load_de_vocab()
en2idx, idx2en = load_en_vocab()

# Index
x_list, y_list, Sources, Targets = [], [], [], []
for source_sent, target_sent in zip(source_sents, target_sents):
    x = [de2idx.get(word, 1) for word in (source_sent + u" </S>").split()] # 1: OOV, </S>: End of Text
    y = [en2idx.get(word, 1) for word in (target_sent + u" </S>").split()] 
    if max(len(x), len(y)) <=hp.maxlen:
        x_list.append(np.array(x))
        y_list.append(np.array(y))
        Sources.append(source_sent)
        Targets.append(target_sent)

# Pad      
X = np.zeros([len(x_list), hp.maxlen], np.int32)
Y = np.zeros([len(y_list), hp.maxlen], np.int32)
for i, (x, y) in enumerate(zip(x_list, y_list)):
    X[i] = np.lib.pad(x, [0, hp.maxlen-len(x)], 'constant', constant_values=(0, 0))
    Y[i] = np.lib.pad(y, [0, hp.maxlen-len(y)], 'constant', constant_values=(0, 0))

return X, Y, Sources, Targets

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
该函数一共有两个参数，source_sents和target_sents。可以理解为源语言和目标语言的句子列表。每个列表中的一个元素就是一个句子。
首先利用之前定义的两个函数生成双语语言的word/id字典。
同时遍历这两个参数指示的句子列表。一次遍历一个句子对，在该次遍历中，给每个句子末尾后加一个文本结束符用以表示句子末尾。加上该结束符的句子又被遍历每个词，同时利用双语word/id字典读取word对应的id加入一个新列表中，若该word不再字典中则id用1代替（即UNK的id）。如此则生辰概率两个用一串id表示的双语句子的列表。然后判断这两个句子的长度是否都没超过设定的句子最大长度hp.maxlen,如果没超过，则将这两个双语句子id列表加入模型要用的双语句子id列表x_list,y_list中，同时将满足最大句子长度的原始句子（用word表示的）也加入到句子列表Sources以及Targets中。
函数后半部分为Pad操作。关于numpy中的pad操作可以参考numpy–prod和pad运算。这里说该函数的pad运算，由于x和y都是一维的，所有只有前后两个方向可以pad，所以pad函数的第二个参数是一个含有两个元素的列表，第一个元素为0说明给x或者y前面什么也不pad，即pad上0个数，第二个元素为hp.maxlen-len(x)以及hp.maxlen-len(x)代表给x和y后面pad上x和y初始元素个数和句子最大长度差的那么多数值，至于pad成什么数值，后面的constant_values给出了，即pad上去的id值为0，这也是我们词汇表中PAD的id。经过pad的操作可以保证用id表示的句子列表都是等长的。
最终返回等长的句子id数组X，Y，以及原始句子李标Sources以及Targets。X和Y的shape都为[len(x_list),hp.maxlen]。其中len(x_list)为句子的总个数，hp.maxlen为设定的最大句子长度。

接下来有一个函数为load_train_data(),还是上代码：

def load_train_data():
de_sents = [regex.sub(“[^\s\p{Latin}’]”, “”, line) for line in codecs.open(hp.source_train, ‘r’, ‘utf-8’).read().split(“\n”) if line and line[0] != “<”]
en_sents = [regex.sub(“[^\s\p{Latin}’]”, “”, line) for line in codecs.open(hp.target_train, ‘r’, ‘utf-8’).read().split(“\n”) if line and line[0] != “<”]

X, Y, Sources, Targets = create_data(de_sents, en_sents)
return X, Y

1
2
3
4
5
6
顾名思义，该函数的作用是加载训练数据，加载的方式很简单，就是加载刚才create_data返回的等长句子id数组。load_train_data的作用只不过是给create_data提供了de_sents和en_sents两个参数而已。
而de_sents和en_sents这两个句子列表同样是通过codecs里的open读取训练数据生成的。读取之后按照换行符\n分隔开每一句，在这些句子中选择那些那些行开头符号不是‘<’的句子(句首为<是数据描述的行，并非真实数据的部分)。在这些分离好的句子中同样用正则表达式进行处理。

接下来是load_test_data()函数。

def load_test_data():
def _refine(line):
line = regex.sub(“<[^>]+>”, “”, line)
line = regex.sub(“[^\s\p{Latin}’]”, “”, line)
return line.strip()

de_sents = [_refine(line) for line in codecs.open(hp.source_test, 'r', 'utf-8').read().split("\n") if line and line[:4] == "<seg"]
en_sents = [_refine(line) for line in codecs.open(hp.target_test, 'r', 'utf-8').read().split("\n") if line and line[:4] == "<seg"]

X, Y, Sources, Targets = create_data(de_sents, en_sents)
return X, Sources, Targets # (1064, 150)

1
2
3
4
5
6
7
8
9
10
11
load_test_data和load_train_data类似，区别不大。是生成测试数据源语言的id表示的定长句子列表（目标语言由模型预测不用生成），同时还有源语言和目标语言原始句子列表。
区别在与正在表达式的操作有些许不同，其中用到了一个函数strip(),默认参数的话就是去掉字符串首以及末尾的空白符。同时数据文件中每行以”

Dropouts

    outputs = tf.layers.dropout(outputs, rate=dropout_rate, training=tf.convert_to_tensor(is_training))

    # Weighted sum
    outputs = tf.matmul(outputs, V_) # ( h*N, T_q, C/h)

    # Restore shape
    outputs = tf.concat(tf.split(outputs, num_heads, axis=0), axis=2 ) # (N, T_q, C)

    # Residual connection
    outputs += queries

    # Normalize
    outputs = normalize(outputs) # (N, T_q, C)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
首先对各种mask之后计算的权值outputs进行dropout，然后用该outputs和V_加权和计算出多个头attention的结果，这里直接用了matmul矩阵乘法计算。outputs的shape为(h*N, T_q, T_k)，V_ 的shape为(h*N, T_k, C/h)，则相乘之后得到的加权和的outputsshape为( h*N, T_q, C/h)。
由于这是多头attention的结果在第一个维度堆叠着，所以现在把他们split开重新concat到最后一个维度上就形成了最终的outputs，其shape为(N, T_q, C)。
之后outputs加上一开始的queries，是一个residual的操作，然后用之前定义好的normalize函数将outputs处理。

return outputs
1
返回最终的outputs。

以上就是multihead_attention函数的全部分析。可以说，以上multihead_attention函数是论文的核心思想，也是该开源代码的核心。

论文中还提到了要把输出送入全连接的前馈网络，接下来是这部分代码。

def feedforward(inputs,
num_units=[2048, 512],
scope=”multihead_attention”,
reuse=None):
”’Point-wise feed forward net.

Args:
  inputs: A 3d tensor with shape of [N, T, C].
  num_units: A list of two integers.
  scope: Optional scope for `variable_scope`.
  reuse: Boolean, whether to reuse the weights of a previous layer
    by the same name.

Returns:
  A 3d tensor with the same shape and dtype as inputs
'''
with tf.variable_scope(scope, reuse=reuse):
    # Inner layer
    params = {"inputs": inputs, "filters": num_units[0], "kernel_size": 1,
              "activation": tf.nn.relu, "use_bias": True}
    outputs = tf.layers.conv1d(**params)

    # Readout layer
    params = {"inputs": outputs, "filters": num_units[1], "kernel_size": 1,
              "activation": None, "use_bias": True}
    outputs = tf.layers.conv1d(**params)

    # Residual connection
    outputs += inputs

    # Normalize
    outputs = normalize(outputs)

return outputs

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
其输入是一个shape为[N,T,C]的张量，num_units是隐藏节点的个数。
该部分操作利用一维卷积进行网络的设计，当时刚一看到代码我还懵了，不过这样确实可以做到。两层卷积之间加了relu非线性操作。之后是residual操作加上inputs残差，然后是normalize。我好奇的是为什么作者不直接用layers.dense直接进行全连接。

最后，对label进行了平滑操作：

def label_smoothing(inputs, epsilon=0.1):
”’Applies label smoothing. See https://arxiv.org/abs/1512.00567.

Args:
  inputs: A 3d tensor with shape of [N, T, V], where V is the number of vocabulary.
  epsilon: Smoothing rate.

For example,

```
import tensorflow as tf
inputs = tf.convert_to_tensor([[[0, 0, 1], 
   [0, 1, 0],
   [1, 0, 0]],
  [[1, 0, 0],
   [1, 0, 0],
   [0, 1, 0]]], tf.float32)

outputs = label_smoothing(inputs)

with tf.Session() as sess:
    print(sess.run([outputs]))

>>
[array([[[ 0.03333334,  0.03333334,  0.93333334],
    [ 0.03333334,  0.93333334,  0.03333334],
    [ 0.93333334,  0.03333334,  0.03333334]],
   [[ 0.93333334,  0.03333334,  0.03333334],
    [ 0.93333334,  0.03333334,  0.03333334],
    [ 0.03333334,  0.93333334,  0.03333334]]], dtype=float32)]   
```    
'''
K = inputs.get_shape().as_list()[-1] # number of channels
return ((1-epsilon) * inputs) + (epsilon / K)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
这部分注释很详细，就不多做介绍了。可以看出把之前的one_hot中的0改成了一个很小的数，1改成了一个比较接近于1的数。

modules.py就到此为止了，该部分是核心内容，虽然比较复杂，但是一行一行看还是可以理解的。接下来是train.py文件，看看网络模型是如何连接并train起来的。

首先是模型导入的包：

from future import print_function
import tensorflow as tf

from hyperparams import Hyperparams as hp
from data_load import get_batch_data, load_de_vocab, load_en_vocab
from modules import *
import os, codecs
from tqdm import tqdm
1
2
3
4
5
6
7
8
这里分别导入了之前写好的几个文件，同时还导入了一个叫做tqdm的模块，用于编写训练进度的进度条。

class Graph():
def init(self, is_training=True):
self.graph = tf.Graph()
with self.graph.as_default():
if is_training:
self.x, self.y, self.num_batch = get_batch_data() # (N, T)
else: # inference
self.x = tf.placeholder(tf.int32, shape=(None, hp.maxlen))
self.y = tf.placeholder(tf.int32, shape=(None, hp.maxlen))

        # define decoder inputs
        self.decoder_inputs = tf.concat((tf.ones_like(self.y[:, :1])*2, self.y[:, :-1]), -1) # 2:<S>

        # Load vocabulary    
        de2idx, idx2de = load_de_vocab()
        en2idx, idx2en = load_en_vocab()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
以上代码创建了一个Graph类，方便tensorflow中图的创建。之后所有图中定义的节点和操作都是以这个图为默认的图的。
首先加载训练数据或测试数据。如果是训练过程，则由之前写好的get_batch_data()得到训练数据以及batch的数量。如果是推断的过程，则将数据定义为placeholder先放着。
数据self.x 和self.y的shape都为[N,T].
然后用self.y来初始化解码器的输入。decoder_inputs和self.y相比，去掉了最后一个句子结束符，而在每句话最前面加了一个初始化为2的id，即，代表开始。shape和self.y一样为[N,T]。
利用之前文件中写好的方法加载德语和英语双语语言的id/word字典。

继续看代码：

# Encoder
with tf.variable_scope(“encoder”):
## Embedding
self.enc = embedding(self.x,
vocab_size=len(de2idx),
num_units=hp.hidden_units,
scale=True,
scope=”enc_embed”)

            ## Positional Encoding
            self.enc += embedding(tf.tile(tf.expand_dims(tf.range(tf.shape(self.x)[1]), 0), [tf.shape(self.x)[0], 1]),
                                  vocab_size=hp.maxlen, 
                                  num_units=hp.hidden_units, 
                                  zero_pad=False, 
                                  scale=False,
                                  scope="enc_pe") 

            ## Dropout
            self.enc = tf.layers.dropout(self.enc, 
                                        rate=hp.dropout_rate, 
                                        training=tf.convert_to_tensor(is_training))

            ## Blocks
            for i in range(hp.num_blocks):
                with tf.variable_scope("num_blocks_{}".format(i)):
                    ### Multihead Attention
                    self.enc = multihead_attention(queries=self.enc, 
                                                    keys=self.enc, 
                                                    num_units=hp.hidden_units, 
                                                    num_heads=hp.num_heads, 
                                                    dropout_rate=hp.dropout_rate,
                                                    is_training=is_training,
                                                    causality=False)

                    ### Feed Forward
                    self.enc = feedforward(self.enc, num_units=[4*hp.hidden_units, hp.hidden_units])

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
这段代码看起来长其实没有什么,主要是定义了 encoder的结构,其定义过程中用的方法大都是在之前 moduel.py 中介绍过的。
首先利用定义好的 embedding 函数对 self.x 这一输入进行 embedding 操作。embedding之后的 self.enc 的 shape 为[N,T,hp.hidden_units]。这一步只是对词的embedding。同时为了保留句子的前后时序信息,需要有一个对位置的embedding,这部分用了简单的 positional embedding,和论文中的描述有一些不同,不过论文作者说两者都可以。
positional embedding 也是用之前的embedding 函数,只不过 embedding 的输入的第二各维度的值不是词的 id,而是变成了该词的位置 id,一共只有 maxlen 种这样的 id,位置的 id 利用了tf.range 实现,最后扩展到了 batch 中的所有句子,因为每个句子中词的位置 id都是一样的。将 word embedding 和 positional embedding 加起来,构成了最终的编码器embedding 输入self.enc，shape仍为[N,T,hp.hidden_units]。
得到embedding输入之后先进行dropout操作，该步操作只在寻来拿的时候执行。
最后就是将输入送到block单元中进行操作。按照论文中描述的，默认为6个这样的block结构。所以代码循环6次。其中每个block都调用了依次multihead_attention以及feedforward函数.在编码器中，multihead_attention的queries和keys都是self.enc，所以这一部分是self attention。attention之后的结果送到feedforward中进行转换，形成该blocks的输出赋给self.enc。

接下来是decoder模块。

Decoder

        with tf.variable_scope("decoder"):
            ## Embedding
            self.dec = embedding(self.decoder_inputs, 
                                  vocab_size=len(en2idx), 
                                  num_units=hp.hidden_units,
                                  scale=True, 
                                  scope="dec_embed")

            ## Positional Encoding
            self.dec += embedding(tf.tile(tf.expand_dims(tf.range(tf.shape(self.decoder_inputs)[1]), 0), [tf.shape(self.decoder_inputs)[0], 1]),
                                  vocab_size=hp.maxlen, 
                                  num_units=hp.hidden_units, 
                                  zero_pad=False, 
                                  scale=False,
                                  scope="dec_pe")

            ## Dropout
            self.dec = tf.layers.dropout(self.dec, 
                                        rate=hp.dropout_rate, 
                                        training=tf.convert_to_tensor(is_training))

            ## Blocks
            for i in range(hp.num_blocks):
                with tf.variable_scope("num_blocks_{}".format(i)):
                    ## Multihead Attention ( self-attention)
                    self.dec = multihead_attention(queries=self.dec, 
                                                    keys=self.dec, 
                                                    num_units=hp.hidden_units, 
                                                    num_heads=hp.num_heads, 
                                                    dropout_rate=hp.dropout_rate,
                                                    is_training=is_training,
                                                    causality=True, 
                                                    scope="self_attention")

                    ## Multihead Attention ( vanilla attention)
                    self.dec = multihead_attention(queries=self.dec, 
                                                    keys=self.enc, 
                                                    num_units=hp.hidden_units, 
                                                    num_heads=hp.num_heads,
                                                    dropout_rate=hp.dropout_rate,
                                                    is_training=is_training, 
                                                    causality=False,
                                                    scope="vanilla_attention")

                    ## Feed Forward
                    self.dec = feedforward(self.dec, num_units=[4*hp.hidden_units, hp.hidden_units])

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
该部分是解码模块。类似于编码器，编码器也是word embedding和positional embedding加上dropout。得到的结果为self.dec，shape为[N,T,hp.hidden_units]。这部分和编码器一样就不多说。
接下来也是blocks的模块。不同于编码器只有一个self attention结构，这里有两个attention结果哦偶。第一个是一个self attention，与编码器中self attention不同的是这里的attention不能利用之后queries的信息，所以要设定multihead_attention的causality参数为True，以屏蔽未来的信息。
解码器的self attention之后跟了一个和编码器输出作为keys的attention，从而将编码器和解码器联系起来。该attention中的causality设置为False，因为解码器中的信息都可以被用到。
接着是一个feedforward层，和编码器中的一样，这种blocks同样有六层。
最终的解码器输出为self.dec,shape为[N,T,hp.hidden_units]。

继续看剩下的代码：

Final linear projection

        self.logits = tf.layers.dense(self.dec, len(en2idx))
        self.preds = tf.to_int32(tf.arg_max(self.logits, dimension=-1))
        self.istarget = tf.to_float(tf.not_equal(self.y, 0))
        self.acc = tf.reduce_sum(tf.to_float(tf.equal(self.preds, self.y))*self.istarget)/ (tf.reduce_sum(self.istarget))
        tf.summary.scalar('acc', self.acc)

1
2
3
4
5
6
首先通过全了链接将解码器的输出转化为shape为[N,T,len(en2idx)]的tensor即self.logits。然后取logits最后一维中最大的值的下标（预测的值的下标）转化为int32类型的tensor，即self.preds，其shape为[N,T]。同时把label（即self.y）中所有id不为0（即是真实的word，不是pad）的位置的值用float型的1.0代替作为self.istarget，其shape为[N,T]。
然后定义一个描述精确度的张量self.acc。在所有是target的位置中，当self.preds和self.y中对应位置值相等时转为float 1.0,否则为0。把这些相等的数加起来看一共占所有target的比例即精确度。然后将self.acc加入summary可以监督训练的过程。
继续看代码：

if is_training:
# Loss
self.y_smoothed = label_smoothing(tf.one_hot(self.y, depth=len(en2idx)))
self.loss = tf.nn.softmax_cross_entropy_with_logits(logits=self.logits, labels=self.y_smoothed)
self.mean_loss = tf.reduce_sum(self.loss*self.istarget) / (tf.reduce_sum(self.istarget))

            # Training Scheme
            self.global_step = tf.Variable(0, name='global_step', trainable=False)
            self.optimizer = tf.train.AdamOptimizer(learning_rate=hp.lr, beta1=0.9, beta2=0.98, epsilon=1e-8)
            self.train_op = self.optimizer.minimize(self.mean_loss, global_step=self.global_step)

            # Summary 
            tf.summary.scalar('mean_loss', self.mean_loss)
            self.merged = tf.summary.merge_all()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
以上代码是只有在训练时才需要的。定义了训练过程中需要用到的一些参数。
首先对label进行平滑，将self.y转为one_hot之后用module中定义的label_smoothing函数进行平滑操作。之后，将平滑操作之后的值作为labels和之前的logits联合起来用tf.nn.softmax_cross_entropy_with_logits函数计算交叉熵作为训练的loss。此时loss的shape为[N,T]。而这其中又那些pad部分的无效词的loss，所以self.loss*self.istarget去掉无效的loss就是真正需要的loss。将这些loss加起来算出平均值极为最后的self.mean_loss。
接着定义global_step，同时选取优化算法。并定义train_op。将mean_loss也加入summary便于追踪。

train文件的最后一部分代码即train模型以及保存的过程：

if name == ‘main‘:
# Load vocabulary
de2idx, idx2de = load_de_vocab()
en2idx, idx2en = load_en_vocab()

# Construct graph
g = Graph("train"); print("Graph loaded")

# Start session
sv = tf.train.Supervisor(graph=g.graph, 
                         logdir=hp.logdir,
                         save_model_secs=0)
with sv.managed_session() as sess:
    for epoch in range(1, hp.num_epochs+1): 
        if sv.should_stop(): break
        for step in tqdm(range(g.num_batch), total=g.num_batch, ncols=70, leave=False, unit='b'):
            sess.run(g.train_op)

        gs = sess.run(g.global_step)   
        sv.saver.save(sess, hp.logdir + '/model_epoch_%02d_gs_%d' % (epoch, gs))

print("Done")

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
首先加载双语字典。然后构建刚才定义用的图对象。定义一个supervisor sv用以监督长时间的训练。所以训练以及保存都用sv自带的session。训练epoch次，每个epoch内执行num_batch次train_op操作，并保存训练的结果。该部分主要用了tqdm来显示进度条。关于tqdm模块的用法可以参考： python的Tqdm模块，用起来也是比较简便的。

以上就是train.py文件里面所有代码的分析。

还剩下最后一个文件，即评估文件，eval.py，用来评估模型的效果。
首先载入要用的模块：

from future import print_function
import codecs
import os

import tensorflow as tf
import numpy as np

from hyperparams import Hyperparams as hp
from data_load import load_test_data, load_de_vocab, load_en_vocab
from train import Graph
from nltk.translate.bleu_score import corpus_bleu
1
2
3
4
5
6
7
8
9
10
11
12
这里载入了之前几个必须的包以及之前定义的模块。最后一个模块是nltk里面方便计算翻译效果bleu score的模块。具体用到时再细说。
接下来加载要测试的数据。

def eval():
# Load graph
g = Graph(is_training=False)
print(“Graph loaded”)

# Load data
X, Sources, Targets = load_test_data()
de2idx, idx2de = load_de_vocab()
en2idx, idx2en = load_en_vocab()

1
2
3
4
5
6
7
8
9
10
首先加载之前定义好的图，把is_training置为False。然后利用load_test_data加载测试数据。并且加载双语word/id字典。

接着载入之前的模型：

Start session

with g.graph.as_default():    
    sv = tf.train.Supervisor()
    with sv.managed_session(config=tf.ConfigProto(allow_soft_placement=True)) as sess:
        ## Restore parameters
        sv.saver.restore(sess, tf.train.latest_checkpoint(hp.logdir))
        print("Restored!")

        ## Get model name
        mname = open(hp.logdir + '/checkpoint', 'r').read().split('"')[1] # model name

1
2
3
4
5
6
7
8
9
10
利用载入的模型对测试数据进行翻译：

Inference

        if not os.path.exists('results'): os.mkdir('results')
        with codecs.open("results/" + mname, "w", "utf-8") as fout:
            list_of_refs, hypotheses = [], []
            for i in range(len(X) // hp.batch_size):

                ### Get mini-batches
                x = X[i*hp.batch_size: (i+1)*hp.batch_size]
                sources = Sources[i*hp.batch_size: (i+1)*hp.batch_size]
                targets = Targets[i*hp.batch_size: (i+1)*hp.batch_size]

                ### Autoregressive inference
                preds = np.zeros((hp.batch_size, hp.maxlen), np.int32)
                for j in range(hp.maxlen):
                    _preds = sess.run(g.preds, {g.x: x, g.y: preds})
                    preds[:, j] = _preds[:, j]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
数据一共又多少个batch就循环多少次。针对每一个batch的循环，取一个mini-batch的数据x。同时将这个batch的双语原始句子也用sources和targets保存起来。然后尽心广泛以。首先初始化翻译结果为int32类型的一个张量，初始值为0，shape为[hp.batch_size, hp.maxlen]。然后针对这个batch的句子从第一个词开始，每个词每个词地预测。这样，后一个词预测的时候就可以利用前面的信息来解码。所以一共循环hp.maxlen次，每次循环用之前的翻译作为解码器的输入翻译的一个词。注意：并不是一次直接翻译完一个句子。

循环结束后，这个batch的句子的翻译保存在preds中。

翻译完成之后将翻译结果写入到文件中：

### Write to file
for source, target, pred in zip(sources, targets, preds): # sentence-wise
got = ” “.join(idx2en[idx] for idx in pred).split(“”)[0].strip()
fout.write(“- source: ” + source +”\n”)
fout.write(“- expected: ” + target + “\n”)
fout.write(“- got: ” + got + “\n\n”)
fout.flush()

                    # bleu score
                    ref = target.split()
                    hypothesis = got.split()
                    if len(ref) > 3 and len(hypothesis) > 3:
                        list_of_refs.append([ref])
                        hypotheses.append(hypothesis)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
preds的结果仍然是id形式的，所以写入文件的时候要转化为word。
对于sources, targets, preds中的每个句子同时进行以下操作：
将pred（pred为preds中的一个句子）的每个id转化为其对应的英文单词，然后将这些单词字符串用一个空格字符串链接起来（join函数的用法）。同时去掉句尾结束符。这样就得到了翻译的由词组成的句子。
分别将源句子，期望翻译的结果以及实际翻译的结果写入文件。
将期望翻译的句子split成列表作为ref，同时模型翻译的句子split乘列表作为hypothesis。

最后就是计算bleu score并写入到文件：

## Calculate bleu score
score = corpus_bleu(list_of_refs, hypotheses)
fout.write(“Bleu Score = ” + str(100*score))
1
2
3
将二者长度都大于3的句子加入到总的列表中作为计算bleu的参数。由此就得到了bleu score，可以用来评估模型。将其写入文件末尾。

最后是执行评估函数：

if name == ‘main‘:
eval()
print(“Done”)
1
2
3
至此，整个transformer的代码都分析完了。

WitsMakeMen

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
机器翻译模型Transformer代码详细解析

转自：https://blog.csdn.net/mijiaoxiaosan/article/details/74909076 谷歌一个月前发了一篇论文Attention is all you need,文中提出了一种新的架构叫做Transformer，用以来实现机器翻译。它抛弃了传统用CNN或者RNN的定式，取得了很好的效果，激起了工业界和学术界的广泛讨论。本人的另一篇博客也对改论文进行了一定...
复制链接

扫一扫