MXNet官方文档教程(3)：基于多层LSTM的字符级语言模型

本文介绍如何使用多层LSTM神经网络训练字符级语言模型，以识别美国总统奥巴马的演讲。文章详细展示了数据预处理步骤、LSTM模型构建、训练过程及生成预测文本的方法。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

这是MXNet继上一篇我们介绍的人工神经网络识别手写数字之后另一个进阶（Advanced）示例，本文使用了最新的LSTM模型。由于本人对自然语言处理方向并无深入了解，故只进行了简单的直译，具体细节术语可查看相关文献及源网站：Character-level language models。

本教程讲授如何通过多层循环人工神经网络来训练一个字符级的语言模型。特别的，我们将训练了一个可以识别美国总统奥巴马演讲的多层LSTM网络。

数据准备

我们首先现在数据集并输出前面一部分字符。

import os

import urllib

import zipfile

if not os.path.exists("char_lstm.zip"):

urllib.urlretrieve("http://data.mxnet.io/data/char_lstm.zip","char_lstm.zip")

with zipfile.ZipFile("char_lstm.zip","r")as f:

f.extractall("./")

with open('obama.txt','r') as f:

print f.read()[0:1000]

输出结果：

Call to Renewal Keynote Address Call to Renewal Pt 1Call to Renewal Part 2 TOPIC: Our Past, Our Future & Vision for America June

28, 2006 Call to Renewal' Keynote Address Complete Text Good morning. I appreciate the opportunity to speak here at the Call to R

enewal's Building a Covenant for a New America conference. I've had the opportunity to take a look at your Covenant for a New Ame

rica. It is filled with outstanding policies and prescriptions for much of what ails this country. So I'd like to congratulate yo

u all on the thoughtful presentations you've given so far about poverty and justice in America, and for putting fire under the fe

et of the political leadership here in Washington.But today I'd like to talk about the connection between religion and politics a

nd perhaps offer some thoughts about how we can sort through some of the often bitter arguments that we've been seeing over the l

ast several years.I do so because, as you all know, we can affirm the importance of povert

之后，我们定义了一些函数来对这些数据进行预处理。

def read_content(path):

    with open(path)as ins:

        return ins.read()

# Return a dict which maps each char into an unique int id

def build_vocab(path):

    content =list(read_content(path))

    idx =1# 0 is left for zero-padding

    the_vocab = {}

    for word in content:

        if len(word)==0:

            continue

        if not word in the_vocab:

            the_vocab[word] = idx

            idx +=1

    return the_vocab

# Encode a sentence with int ids

def text2id(sentence, the_vocab):

    words =list(sentence)

    return [the_vocab[w] for w in words if len(w)>0]

# build char vocabluary from input

vocab= build_vocab("./obama.txt")

print('vocab size = %d'%(len(vocab)))

输出：

vocab size = 83

建立LSTM模型

现在我们建立一个多层LSTM模型。LSTM单元定义的实现在lstm.py中：

import lstm

# Each line contains at most 129 chars.

seq_len=129

# embedding dimension, which maps a character to a 256-dimension vector

num_embed=256

# number of lstm layers

num_lstm_layer=3

# hidden unit in LSTM cell

num_hidden=512

symbol= lstm.lstm_unroll(

    num_lstm_layer,

    seq_len,

    len(vocab)+1,

    num_hidden=num_hidden,

    num_embed=num_embed,

    num_label=len(vocab)+1,

    dropout=0.2)

训练

首先我们创建一个数据迭代器：

import bucket_io

# The batch size for training

batch_size=32

# initalize states for LSTM

init_c= [('l%d_init_c'%l, (batch_size, num_hidden)) for l inrange(num_lstm_layer)]

init_h= [('l%d_init_h'%l, (batch_size, num_hidden)) for l inrange(num_lstm_layer)]

init_states= init_c + init_h

# Even though BucketSentenceIter supports various length examples,

# we simply use the fixed length version here

data_train= bucket_io.BucketSentenceIter(

    "./obama.txt",

    vocab,

    [seq_len],

    batch_size,

    init_states,

    seperate_char='\n',

    text2id=text2id,

    read_content=read_content)

输出：

Summary of dataset ==================

bucket of len 129 : 8290 samples

然后我们使用标准model.fit实现来训练：

import mxnet as mx

import numpy as np

import logging

logging.getLogger().setLevel(logging.DEBUG)

# We will show a quick demo with only 1 epoch. In practice, we can set it to be 100

num_epoch=1

# learning rate

learning_rate=0.01

# Evaluation metric

def Perplexity(label, pred):

    loss =0.

    for i inrange(pred.shape[0]):

        loss +=-np.log(max(1e-10, pred[i][int(label[i])]))

    return np.exp(loss/ label.size)

model= mx.model.FeedForward(

    ctx=mx.gpu(0),

    symbol=symbol,

    num_epoch=num_epoch,

    learning_rate=learning_rate,

    momentum=0,

    wd=0.0001,

    initializer=mx.init.Xavier(factor_type="in", magnitude=2.34))

model.fit(X=data_train,

          eval_metric=mx.metric.np(Perplexity),

          batch_end_callback=mx.callback.Speedometer(batch_size,20),

          epoch_end_callback=mx.callback.do_checkpoint("obama"))

输出：

          batch_end_callback=mx.callback.Speedometer(batch_size,20),

          epoch_end_callback=mx.callback.do_checkpoint("obama"))

INFO:root:Start training with [gpu(0)]

INFO:root:Epoch[0] Batch [20]   Speed: 36.09 samples/sec    Train-Perplexity=38.167996

INFO:root:Epoch[0] Batch [40]   Speed: 34.29 samples/sec    Train-Perplexity=24.568035

INFO:root:Epoch[0] Batch [60]   Speed: 34.32 samples/sec    Train-Perplexity=23.439121

INFO:root:Epoch[0] Batch [80]   Speed: 34.26 samples/sec    Train-Perplexity=23.209663

INFO:root:Epoch[0] Batch [100]  Speed: 34.28 samples/sec    Train-Perplexity=22.835044

INFO:root:Epoch[0] Batch [120]  Speed: 34.29 samples/sec    Train-Perplexity=22.745794

INFO:root:Epoch[0] Batch [140]  Speed: 34.29 samples/sec    Train-Perplexity=22.500408

INFO:root:Epoch[0] Batch [160]  Speed: 34.23 samples/sec    Train-Perplexity=22.543436

INFO:root:Epoch[0] Batch [180]  Speed: 34.24 samples/sec    Train-Perplexity=22.566656

INFO:root:Epoch[0] Batch [200]  Speed: 34.30 samples/sec    Train-Perplexity=22.378215

INFO:root:Epoch[0] Batch [220]  Speed: 34.31 samples/sec    Train-Perplexity=22.458195

INFO:root:Epoch[0] Batch [240]  Speed: 34.30 samples/sec    Train-Perplexity=22.655659

INFO:root:Epoch[0] Resetting Data Iterator

INFO:root:Epoch[0] Time cost=241.197

INFO:root:Saved checkpoint to "obama-0001.params"

推理

我们首先定义一些效用函数来帮助我们进行推理：

from rnn_model import LSTMInferenceModel

# helper strcuture for prediction

def MakeRevertVocab(vocab):

    dic = {}

    for k, v in vocab.items():

        dic[v] = k

    return dic

# make input from char

def MakeInput(char, vocab, arr):

    idx = vocab[char]

    tmp = np.zeros((1,))

    tmp[0]= idx

    arr[:] = tmp

# helper function for random sample

def _cdf(weights):

    total =sum(weights)

    result = []

    cumsum =0

    for w in weights:

        cumsum += w

        result.append(cumsum/ total)

    return result

def _choice(population, weights):

    assertlen(population)==len(weights)

    cdf_vals = _cdf(weights)

    x = random.random()

    idx = bisect.bisect(cdf_vals, x)

    return population[idx]

# we can use random output or fixed output by choosing largest probability

def MakeOutput(prob, vocab, sample=False, temperature=1.):

    if sample ==False:

        idx = np.argmax(prob, axis=1)[0]

    else:

        fix_dict = [""]+ [vocab[i] for i inrange(1,len(vocab)+1)]

        scale_prob = np.clip(prob,1e-6,1-1e-6)

        rescale = np.exp(np.log(scale_prob)/ temperature)

        rescale[:] /= rescale.sum()

        return _choice(fix_dict, rescale[0, :])

    try:

        char = vocab[idx]

    except:

        char =''

    return char

之后我们可以建立推理模型：

import rnn_model

# load from check-point

_, arg_params, __ = mx.model.load_checkpoint("obama",75)

# build an inference model

model= rnn_model.LSTMInferenceModel(

    num_lstm_layer,

    len(vocab)+1,

    num_hidden=num_hidden,

    num_embed=num_embed,

    num_label=len(vocab)+1,

    arg_params=arg_params,

    ctx=mx.gpu(),

    dropout=0.2)

现在我们可以产生一个以“The United States”开头的600字符的序列：

seq_length=600

input_ndarray= mx.nd.zeros((1,))

revert_vocab= MakeRevertVocab(vocab)

# Feel free to change the starter sentence

output='The United States'

random_sample=False

new_sentence=True

ignore_length=len(output)

for i inrange(seq_length):

    if i <= ignore_length -1:

        MakeInput(output[i], vocab, input_ndarray)

    else:

        MakeInput(output[-1], vocab, input_ndarray)

    prob = model.forward(input_ndarray, new_sentence)

    new_sentence =False

    next_char = MakeOutput(prob, revert_vocab, random_sample)

    if next_char =='':

        new_sentence =True

    if i >= ignore_length -1:

        output += next_char

print(output)

输出：

The United States of America. That's why I'm running for President.The first place we can do better than that they can afford to get the that they can afford to differ on the part of the political settlement. The second part of the problem is that the consequences would have to see the chance to starthe country that we can start by the challenges of the American people. The American people have been talking about how to compete with the streets of San Antonio who are serious about the courage to come together as one people. That the American people have been trying to get there. And they say