CNTK API文档翻译(18)——多对多神经网络处理文本数据(2)

(本期教程需要翻译的内容实在是太多了,将其分割成两期,上期主要讲理论和模型创建,本期主要讲训练、测试、优化等)

训练

在我们开始训练之前,我们将定义训练封装器、贪婪解码封装器以及用于训练模型的准则函数。首先是训练封装器。

def create_model_train(s2smodel):
    # model used in training (history is known from labels)
    # note: the labels must NOT contain the initial <s>
    @C.Function
    def model_train(input, labels): # (input*, labels*) --> (word_logp*)

        # The input to the decoder always starts with the special label sequence start token.
        # Then, use the previous value of the label sequence (for training) or the output (for execution).
        past_labels = C.layers.Delay(initial_state=sentence_start)(labels)
        return s2smodel(past_labels, input)
    return model_train

上面我们又使用@Function装饰器创建了一个CNTK函数对象model_train。这个函数的参数是输入序列input和输出序列labels。past_labels变量使用Delay层保存了我们先前创建的模型的历史记录。这会返回之前单位时间的输入labels。因此,如果我们将labels设定为[‘a’, ‘b’, ‘c’],past_labels的值将会是[‘’, ‘a’, ‘b’, ‘c’],然后返回调用past_labels和input的模型。

接着创建贪婪解码模型封装器:

def create_model_greedy(s2smodel):
    # model used in (greedy) decoding (history is decoder's own output)
    # (input*) --> (word_sequence*)
    @C.Function
    @C.layers.Signature(InputSequence[C.layers.Tensor[input_vocab_dim]])
    def model_greedy(input): 

        # Decoding is an unfold() operation starting from sentence_start.
        # We must transform s2smodel (history*, input* -> word_logp*) into a generator (history* -> output*)
        # which holds 'input' in its closure.
        unfold = C.layers.UnfoldFrom(lambda history: s2smodel(history, input) >> C.hardmax,
                            # stop once sentence_end_index was max-scoring output
                            until_predicate=lambda w: w[...,sentence_end_index],
                            length_increase=length_increase)

        return unfold(initial_state=sentence_start, dynamic_axes_like=input)
    return model_greedy

上面我们创建了一个新的CNTK函数对象model_greedy,他只有一个参数。这当然是因为当我们将这个模型用于测试时,我们没有任何标签——创建标签是模型的工作。在这种情况下,我们使用UnfoldFrom层用于使用当前的历史数据运行模型,然后生成输出hardmax。输出的hardmax之后会成为历史值的一部分,然后我们会继续执行递归直到达到sentence_end_index。输出序列的最大长度是length_increase的倍数。

在训练之前的最后一件事前是为我们的模型创建准则函数

def create_criterion_function(model):
    @C.Function
    @C.layers.Signature(input=InputSequence[C.layers.Tensor[input_vocab_dim]], 
                        labels=LabelSequence[C.layers.Tensor[label_vocab_dim]])
    def criterion(input, labels):
        # criterion function must drop the <s> from the labels
        # <s> A B C </s> --> A B C </s>
        postprocessed_labels = C.sequence.slice(labels, 1, 0) 
        z = model(input, postprocessed_labels)
        ce = C.cross_entropy_with_softmax(z, postprocessed_labels)
        errs = C.classification_error(z, postprocessed_labels)
        return (ce, errs)

    return criterion

上面我们创建了准则函数,他会将标签中的序列开始符号去掉,使用给定的input和labels执行模型,然后使用输出值和标准比较。我们使用成本函数cross_entropy_with_softmax,计算lassification_error,表示每个单词的误差百分比,以此表征我们的生成的精度。CNTK函数对象criterion以元组的形式返回上诉两个值,python函数create_criterion_function返回上面的CNTK函数对象。

现在我们创建训练循环

def train(train_reader, valid_reader, vocab, i2w, s2smodel, max_epochs, epoch_size):

    # create the training wrapper for the s2smodel, as well as the criterion function
    model_train = create_model_train(s2smodel)
    criterion = create_criterion_function(model_train)

    # also wire in a greedy decoder so that we can properly log progress on a validation example
    # This is not used for the actual training process.
    model_greedy = create_model_greedy(s2smodel)

    # Instantiate the trainer object to drive the model training
    minibatch_size = 72
    lr = 0.001 if use_attention else 0.005
    learner = C.fsadagrad(model_train.parameters,
                          lr = C.learning_rate_schedule([lr]*2+[lr/2]*3+[lr/4], C.UnitType.sample, epoch_size),
                          momentum = C.momentum_as_time_constant_schedule(1100),
                          gradient_clipping_threshold_per_sample=2.3,
                          gradient_clipping_with_truncation=True)
    trainer = C.Trainer(None, criterion, learner)

    # Get minibatches of sequences to train with and perform model training
    total_samples = 0
    mbs = 0
    eval_freq = 100

    # print out some useful training information
    C.logging.log_number_of_parameters(model_train) ; print()
    progress_printer = C.logging.ProgressPrinter(freq=30, tag='Training')    

    # a hack to allow us to print sparse vectors
    sparse_to_dense = create_sparse_to_dense(input_vocab_dim)

    for epoch in range(max_epochs):
        while total_samples < (epoch+1) * epoch_size:
            # get next minibatch of training data
            mb_train = train_reader.next_minibatch(minibatch_size)

            # do the training
            trainer.train_minibatch({criterion.arguments[0]: mb_train[train_reader.streams.features], 
                                     criterion.arguments[1]: mb_train[train_reader.streams.labels]})

            # log progress
            progress_printer.update_with_trainer(trainer, with_metric=True)

            # every N MBs evaluate on a test sequence to visually show how we're doing
            if mbs % eval_freq == 0:
                mb_valid = valid_reader.next_minibatch(1)

                # run an eval on the decoder output model (i.e. don't use the groundtruth)
                e = model_greedy(mb_valid[valid_reader.streams.features])
                print(format_sequences(sparse_to_dense(mb_valid[valid_reader.streams.features]), i2w))
                print("->")
                print(format_sequences(e, i2w))

                # visualizing attention window
                if use_attention:
                    debug_attention(model_greedy, mb_valid[valid_reader.streams.features])

            total_samples += mb_train[train_reader.streams.labels].num_samples
            mbs += 1

        # log a summary of the stats for the epoch
        progress_printer.epoch_summary(with_metric=True)

    # done: save the final model
    model_path = "model_%d.cmf" % epoch
    print("Saving final model to '%s'" % model_path)
    s2smodel.save(model_path)
    print("%d epochs complete." % max_epochs)

在上面的函数中,我们将模型用于训练和评估。一般来是不需要评估的,我们做是因为我们需要周期性的查看训练过程中的样本序列,以此来看我们的模型是如何收敛的。

然后我们设置了一些训练中需要的标准变量。我们设置了minibatch_size(取样包里所有要素的数量),初始学习速率lr,我们使用adam_sgd算法初始化了一个学习器,还设置了learning_rate_schedule,用于缓慢的减少学习速率。我们使用梯度削减来防止梯度膨胀,最终创建了一个训练器对象trainer。

我们使用CNTK的ProgressPrinter类来监测每个取样包/每轮的度量平均值,我们设置为每30个取样包更新一次。最后在开始训练之前,我们增加一个sparse_to_dense函数在验证的时候打印输入序列数据,因为验证数据是洗漱的。函数如下

# dummy for printing the input sequence below. Currently needed because input is sparse.
def create_sparse_to_dense(input_vocab_dim):
    I = C.Constant(np.eye(input_vocab_dim))
    @C.Function
    @C.layers.Signature(InputSequence[C.layers.SparseTensor[input_vocab_dim]])
    def no_op(input):
        return C.times(input, I)
    return no_op

在训练时,我们与训练我们之前经历过的其他CNTK神经网络一样。我们请求下一个取样包数据,然后执行我们的额训练,再使用progress_printer把训练过程打印出来。和普通神经网络不同的是我们运行了一个验证版本的神经网络,用它执行一个序列”ABADI”来看他的预测值是怎样的。

另外一个不同点是我们可以选择使用注意力模型,并能在窗口中可视化出来。调用函数debug_attention可以显示解码器使用的每个编码器每个输出信息的隐藏状态的权重。这个函数需要另外一个format_sequences,用来将输入输出序列打印到屏幕上。函数如下

# Given a vocab and tensor, print the output
def format_sequences(sequences, i2w):
    return [" ".join([i2w[np.argmax(w)] for w in s]) for s in sequences]

# to help debug the attention window
def debug_attention(model, input):
    q = C.combine([model, model.attention_model.attention_weights])
    #words, p = q(input) # Python 3
    words_p = q(input)
    words = words_p[0]
    p     = words_p[1]
    seq_len = words[0].shape[attention_axis-1]
    #attention_span  #7 # test sentence is 7 tokens long
    span = 7
    # (batch, len, attention_span, 1, vector_dim)
    p_sq = np.squeeze(p[0][:seq_len,:span,0,:])
    opts = np.get_printoptions()
    np.set_printoptions(precision=5)
    print(p_sq)
    np.set_printoptions(**opts)

让我们使用一个完整训练周期的一小部分来训练一下我们的神经网络。具体来说,我们将运行25000条数据(大概1个完整周期的的3%)

model = create_model()
train(train_reader, valid_reader, vocab, i2w, model, max_epochs=1, epoch_size=25000)

输出的数据

['<s> A B A D I </s>']
->
['O O ~K ~K X X X X ~JH ~JH ~JH']
[[ 0.14327  0.14396  0.14337  0.14305  0.14248  0.1422   0.14166]
 [ 0.14327  0.14395  0.14337  0.14305  0.14248  0.1422   0.14166]
 [ 0.14327  0.14396  0.14337  0.14305  0.14248  0.1422   0.14166]
 [ 0.14328  0.14395  0.14337  0.14305  0.14248  0.1422   0.14166]
 [ 0.14327  0.14395  0.14337  0.14305  0.14248  0.1422   0.14166]
 [ 0.14327  0.14395  0.14337  0.14305  0.14248  0.1422   0.14166]
 [ 0.14327  0.14396  0.14337  0.14305  0.14248  0.1422   0.14166]
 [ 0.14327  0.14396  0.14337  0.14305  0.14248  0.1422   0.14166]
 [ 0.14327  0.14395  0.14337  0.14305  0.14248  0.1422   0.14166]
 [ 0.14327  0.14395  0.14337  0.14305  0.14248  0.1422   0.14166]
 [ 0.14327  0.14396  0.14337  0.14305  0.14248  0.1422   0.14166]]
 Minibatch[   1-  30]: loss = 4.145903 * 1601, metric = 87.32% * 1601;
 Minibatch[  31-  60]: loss = 3.648827 * 1601, metric = 86.45% * 1601;
 Minibatch[  61-  90]: loss = 3.320400 * 1548, metric = 88.44% * 1548;
['<s> A B A D I </s>']
->
['~N ~N </s>']
[[ 0.14276  0.14348  0.14298  0.1428   0.1425   0.14266  0.14281]
 [ 0.14276  0.14348  0.14298  0.14281  0.1425   0.14266  0.14281]
 [ 0.14276  0.14348  0.14298  0.14281  0.1425   0.14266  0.14281]]
 Minibatch[  91- 120]: loss = 3.231915 * 1567, metric = 86.02% * 1567;
 Minibatch[ 121- 150]: loss = 3.212445 * 1580, metric = 83.54% * 1580;
 Minibatch[ 151- 180]: loss = 3.214926 * 1544, metric = 84.26% * 1544;
['<s> A B A D I </s>']
->
['~R ~R ~AH ~AH ~AH </s>']
[[ 0.14293  0.14362  0.14306  0.14283  0.14246  0.14252  0.14259]
 [ 0.14293  0.14362  0.14306  0.14283  0.14246  0.14252  0.14259]
 [ 0.14293  0.14362  0.14306  0.14283  0.14246  0.14252  0.14259]
 [ 0.14293  0.14362  0.14306  0.14283  0.14246  0.14252  0.14259]
 [ 0.14293  0.14362  0.14306  0.14283  0.14246  0.14252  0.14259]
 [ 0.14293  0.14362  0.14306  0.14283  0.14246  0.14252  0.14259]]
 Minibatch[ 181- 210]: loss = 3.144272 * 1565, metric = 82.75% * 1565;
 Minibatch[ 211- 240]: loss = 3.185484 * 1583, metric = 83.20% * 1583;
 Minibatch[ 241- 270]: loss = 3.126284 * 1562, metric = 83.03% * 1562;
 Minibatch[ 271- 300]: loss = 3.150704 * 1551, metric = 83.56% * 1551;
['<s> A B A D I </s>']
->
['~R ~R ~R ~AH </s>']
[[ 0.14318  0.14385  0.14318  0.14286  0.14238  0.1423   0.14224]
 [ 0.14318  0.14385  0.14318  0.14286  0.14238  0.1423   0.14224]
 [ 0.14318  0.14385  0.14318  0.14287  0.14238  0.1423   0.14224]
 [ 0.14318  0.14385  0.14318  0.14287  0.14239  0.1423   0.14224]
 [ 0.14318  0.14385  0.14318  0.14287  0.14239  0.1423   0.14224]]
 Minibatch[ 301- 330]: loss = 3.131863 * 1575, metric = 82.41% * 1575;
 Minibatch[ 331- 360]: loss = 3.095721 * 1569, metric = 82.98% * 1569;
 Minibatch[ 361- 390]: loss = 3.098615 * 1567, metric = 82.32% * 1567;
['<s> A B A D I </s>']
->
['~K ~R ~R ~AH </s>']
[[ 0.14352  0.14416  0.14335  0.14292  0.1423   0.14201  0.14173]
 [ 0.1435   0.14414  0.14335  0.14293  0.14231  0.14202  0.14174]
 [ 0.14351  0.14415  0.14335  0.14293  0.1423   0.14202  0.14174]
 [ 0.14351  0.14415  0.14335  0.14293  0.1423   0.14202  0.14174]
 [ 0.14351  0.14415  0.14335  0.14293  0.1423   0.14202  0.14174]]
 Minibatch[ 391- 420]: loss = 3.115971 * 1601, metric = 81.70% * 1601;
Finished Epoch[1 of 300]: [Training] loss = 3.274279 * 22067, metric = 84.14% * 22067 64.263s (343.4 samples/s);
Saving final model to 'model_0.cmf'
1 epochs complete.

如我们在上面的输出数据中缩减,成本值降低了不少,不过输出的序列离我们的期望值还有点距离。取消下面代码的注释,使用一个完整的周期进行训练,在第一个周期完成之后都将看到一个非常好的从文字到语音的模型。

# Uncomment the line below to train the model for a full epoch
#train(train_reader, valid_reader, vocab, i2w, model, max_epochs=1, epoch_size=908241)

测试神经网络

现在我们训练了一个用于文字和读音转换的sequence-to-sequence神经网络,我们要对此做两件重要的事情。第一,我们需要使用留存数据测试他的经度。然后,我们需要将他放进一个交互式的环境,一遍我们能将其用进我们自己的输入序列,来看看模型表现的怎么样。首先我们来看看模型的错误率。

在训练结束时,我们使用s2smodel.save(model_path)保存了模型。因此,要测试他,我们首先需要加载这个模型,然后使用它运行一些数据。我们先加载模型,然后创建一个可以用于读取测试数据的读取器。注意我们这次给create_reader函数传入了False表示我们这是测试模式,所有的数据只需要使用一次。

# load the model for epoch 0
model_path = "model_0.cmf"
model = C.Function.load(model_path)

# create a reader pointing at our testing data
test_reader = create_reader(dataPath['testing'], False)

现在我们需要定义我们的测试函数。我们给他传入数据读取器reader,已经训练好的s2smodel和词汇表i2w,以便我们可以直接比较模型的预测值和测试数据集的标签值。我们循环运行测试数据集,为了更快,取样包大小设置为512,然后追踪其错误率。注意下面我们是根据每个序列测试的,这表示在生成的序列中的每个信息都应该和标签序列中的对应信息相匹配。

# This decodes the test set and counts the string error rate.
def evaluate_decoding(reader, s2smodel, i2w):

    # wrap the greedy decoder around the model
    model_decoding = create_model_greedy(s2smodel) 

    progress_printer = C.logging.ProgressPrinter(tag='Evaluation')

    sparse_to_dense = create_sparse_to_dense(input_vocab_dim)

    minibatch_size = 512
    num_total = 0
    num_wrong = 0
    while True:
        mb = reader.next_minibatch(minibatch_size)
        # finish when end of test set reached
        if not mb: 
            break
        e = model_decoding(mb[reader.streams.features])
        outputs = format_sequences(e, i2w)
        labels  = format_sequences(sparse_to_dense(mb[reader.streams.labels]), i2w)
        # prepend sentence start for comparison
        outputs = ["<s> " + output for output in outputs]

        num_total += len(outputs)
        num_wrong += sum([label != output for output, label in zip(outputs, labels)])

    rate = num_wrong / num_total
    print("string error rate of {:.1f}% in {} samples".format(100 * rate, num_total))
    return rate

现在我们将使用上面的函数评估解码过程。如果你使用我们上面仅使用50000样本训练的模型,你将得到100%的错误率,因为我们不可能用这么少量的训练让每一个输出的信号都正确。不过,如果我们取消上面的代码,使用完整的训练周期,你将得到一个大大提升的模型,训练的统计结果大概如下:

Finished Epoch[1 of 300]: [Training] loss = 0.878420 * 799303, metric = 26.23% * 799303 1755.985s (455.2 samples/s);

现在让我们评估模型在测试数据集上的表现。

# print the string error rate
evaluate_decoding(test_reader, model, i2w)

如果你不是训练了一个完整的周期,上面的输出将会是1,表示100%的错误率。如果你取消了那部分代码的注释,执行一个完整周期的训练,你讲得到一个0.569的输出值。表示错误率是56.9%,这个数据对于每个数据只训练一遍来说已经很不错了。现在我们改变上面的evaluate_decoding函数,来输出每一个读音的错误率。这表示我们更精确的计算他的错误,也让我们的工作看起来不那么艰辛,因为之前在一个字符串中即使有一个错误,其他都正确,错误率都是100%。下面是改变后的代码。

# This decodes the test set and counts the string error rate.
def evaluate_decoding(reader, s2smodel, i2w):

    # wrap the greedy decoder around the model
    model_decoding = create_model_greedy(s2smodel)

    progress_printer = C.logging.ProgressPrinter(tag='Evaluation')

    sparse_to_dense = create_sparse_to_dense(input_vocab_dim)

    minibatch_size = 512
    num_total = 0
    num_wrong = 0
    while True:
        mb = reader.next_minibatch(minibatch_size)
        # finish when end of test set reached
        if not mb:
            break
        e = model_decoding(mb[reader.streams.features])
        outputs = format_sequences(e, i2w)
        labels  = format_sequences(sparse_to_dense(mb[reader.streams.labels]), i2w)
        # prepend sentence start for comparison
        outputs = ["<s> " + output for output in outputs]

        for s in range(len(labels)):
            for w in range(len(labels[s])):
                num_total += 1
                # in case the prediction is longer than the label
                if w < len(outputs[s]):
                    if outputs[s][w] != labels[s][w]:
                        num_wrong += 1

    rate = num_wrong / num_total
    print("{:.1f}".format(100 * rate))
    return rate


# print the phoneme error rate
test_reader = create_reader(dataPath['testing'], False)
evaluate_decoding(test_reader, model, i2w)

如果你是用训练了一个完整周期的模型,你得到的音符错误率大概在10%左右(如果使用不完整的模型,大概是45%),很不错是不是。这表示在测试数据集中的383294个音符中,我们的模型预测正确了近90%。接下来让我们在一个交互式的环境下使用模型,在这个环境下我们能输入我们自己的序列,看模型如何预测他们的发音。另外我们将会把解码器的注意力模型可视化,来看看我们输入的那个字母对生成读音比较重要。注意下面的例子只有在你至少训练了一个完整周期的情况下才会表现良好。

交互式环境

这里我们将写一个交互式函数,让我们能够比较方便的跟训练好的模型交互,来试试输入我们自己的输入序列。注意如果我们的模型只训练了不到一个完整周期,结果将非常难看。上面我们用到的模型训练了一个周期,表现还不错,如果你有时间和耐心训练30个周期,他的表现将会非常棒。

我们会首先需要引入一些图形模块用来展示注意力模型,随后我们将定义一个translate函数,他接收numpy数组作为输入数据,然后执行模型。

# imports required for showing the attention weight heatmap
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

def translate(tokens, model_decoding, vocab, i2w, show_attention=False):

    vdict = {v:i for i,v in enumerate(vocab)}
    try:
        w = [vdict["<s>"]] + [vdict[c] for c in tokens] + [vdict["</s>"]]
    except:
        print('Input contains an unexpected token.')
        return []

    # convert to one_hot
    query = C.Value.one_hot([w], len(vdict))
    pred = model_decoding(query)
    # first sequence (we only have one) -> [len, vocab size]
    pred = pred[0]
    if use_attention:
        # attention has extra dimensions
        pred = pred[:,0,0,:]

    # print out translation and stop at the sequence-end tag
    prediction = np.argmax(pred, axis=-1)
    translation = [i2w[i] for i in prediction]

    # show attention window (requires matplotlib, seaborn, and pandas)
    if use_attention and show_attention:    
        q = C.combine([model_decoding.attention_model.attention_weights])
        att_value = q(query)

        # get the attention data up to the length of the output (subset of the full window)
        # -> (len, span)
        att_value = att_value[0][0:len(prediction),0:len(w),0,0]

        # set up the actual words/letters for the heatmap axis labels
        columns = [i2w[ww] for ww in prediction]
        index = [i2w[ww] for ww in w]

        dframe = pd.DataFrame(data=np.fliplr(att_value.T), columns=columns, index=index)
        sns.heatmap(dframe)
        plt.show()

    return translation

上面的translate函数的参数有tokens(用户输入的字符列表),model_decoding(我们模型的贪婪解码版本),vocab(词汇表),i2w(vocab的索引映射),show_attention (决定是否显示注意力矢量)

我们先将我们的输入字符转换为以为有效吗,使用model_decoding(query)将其在模型中运行一遍,现在每个预测值都是在词汇表离的概率分布,我们使用argmax获取最有可能的输出值。

为了可视化注意力窗口,我们使用combine函数将attention_weights转换成CNTK函数对象,用来保存我们希望的输入值。通过这种方式,当我们运行q函数是,输出将会是attention_weights的值。我们做一些数据操作,将数据转换为sns接受的格式,实现可视化。

最后,我们需要些一个用户交互循环,允许用户输入多个输入序列。

def interactive_session(s2smodel, vocab, i2w, show_attention=False):

    # wrap the greedy decoder around the model
    model_decoding = create_model_greedy(s2smodel)

    import sys

    print('Enter one or more words to see their phonetic transcription.')
    while True:
        # Testing a prefilled text for routine testing
        if isTest():
            line = "psychology"
        else:    
            line = input("> ")
        if line.lower() == "quit":
            break
        # tokenize. Our task is letter to sound.
        out_line = []
        for word in line.split():
            in_tokens = [c.upper() for c in word]
            out_tokens = translate(in_tokens, model_decoding, vocab, i2w, show_attention=True)
            out_line.extend(out_tokens)
        out_line = [" " if tok == '</s>' else tok[1:] for tok in out_line]
        print("=", " ".join(out_line))
        sys.stdout.flush()
        #If test environment we will test the translation only once
        if isTest():
            break

上面的函数使用我们的模型简单的创建了一个贪婪解码器,然后不断地请求用户输入,然后将输入序列传入我们的translate函数。输入完成之后,就可以实现注意力数据的可视化。

interactive_session(model, vocab, i2w, show_attention=True)

image

注意注意力权重展示输入数据不同部分对于生成不同信号的重要性。对于像机器翻译这样的任务,由于语言之间的语法差异,对应的单词顺序经常改变,有一首的是不过图中我们看到注意力窗口离对角线越来越远,这种现象在字符到语音的转换中经常看到。


欢迎扫码关注我的微信公众号获取最新文章
image

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值