序列模型和长短时记忆网络-学习使用Pytorch中的LSTM

最新推荐文章于 2024-06-20 11:19:35 发布

idevede

最新推荐文章于 2024-06-20 11:19:35 发布

阅读量9.8k

点赞数 1

分类专栏：我的心得机器学习算法 NLP 数据挖掘文章标签： LSTM 时间序列 PyTorch 机器学习 NLP

本文链接：https://blog.csdn.net/idevede/article/details/89180640

版权

我的心得同时被 3 个专栏收录

19 篇文章 0 订阅

订阅专栏

机器学习

7 篇文章 0 订阅

订阅专栏

算法

6 篇文章 0 订阅

订阅专栏

pytorch tutorial LSTM部分英文版

我们之前学习过得各种前馈网络根本没有维护之前的状态。这可能不是我们想要的行为。序列模型是NLP的核心：它们是输入之间存在某种依赖关系的模型。序列模型的经典示例是用于词性标注的隐马尔可夫模型。另一个例子是条件随机场。

循环神经网络是维持某种状态的网络。例如，它的输出可以用作下一个输入的一部分，这样当网络通过序列时，信息可以传播。在LSTM的情况下，对于序列中的每个元素，存在相应的隐藏状态ht，其原则上可以包含来自序列中较早的任意点的信息。我们可以使用隐藏状态来预测语言模型中的单词，词性标签以及无数其他内容。

LSTM在Pytorch

在开始示例之前，请注意一些事项。 Pytorch的LSTM期望它的所有输入都是3D张量。这些张量的轴的语义很重要。第一个轴是序列本身，第二个轴是mini-batch中的索引实例，第三个索引（轴）是输入的元素。我们还没有讨论过mini-batch处理，所以让我们忽略它并假设我们在第二轴上总是只有1维。如果我们想在“The cow jumped”这句话上运行序列模型，我们的输入应该是这样的：

除了记住，还有一个额外的第二维，大小为1。

此外，我们可以一次查看一个序列，在这种情况下，第一个轴也将具有大小1。

让我们看一个简单的例子。

# Author: Robert Guthrie

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

lstm = nn.LSTM(3, 3)  # Input dim is 3, output dim is 3
inputs = [torch.randn(1, 3) for _ in range(5)]  # make a sequence of length 5

# initialize the hidden state.
hidden = (torch.randn(1, 1, 3),
          torch.randn(1, 1, 3))
for i in inputs:
    # Step through the sequence one element at a time.
    # after each step, hidden contains the hidden state.
    out, hidden = lstm(i.view(1, 1, -1), hidden)

# alternatively, we can do the entire sequence all at once.
# the first value returned by LSTM is all of the hidden states throughout
# the sequence. the second is just the most recent hidden state
# (compare the last slice of "out" with "hidden" below, they are the same)
# The reason for this is that:
# "out" will give you access to all hidden states in the sequence
# "hidden" will allow you to continue the sequence and backpropagate,
# by passing it as an argument  to the lstm at a later time
# Add the extra 2nd dimension
# 每一个LSTM有两个输出
inputs = torch.cat(inputs).view(len(inputs), 1, -1)
hidden = (torch.randn(1, 1, 3), torch.randn(1, 1, 3))  # clean out hidden state
out, hidden = lstm(inputs, hidden)
print(out)
print(hidden)

输出：

示例：用于词性标注的LSTM

在本节中，我们将使用LSTM来获取部分语音标签。我们不会使用Viterbi或Forward-Backward或类似的东西，但作为（挑战性）练习给读者，想一想在看到正在发生的事情之后如何使用Viterbi。

模型如下：让我们的输入句子为w1，...，wM，其中wi∈V，我们的词汇表。另外，让T为我们的标签集，yi为单词wi的标签。 $\hat{y$_i$}$ 表示我们对单词wi的标记的预测。

这是结构预测模型，其中我们的输出是序列 $\hat{y$_1$},..\hat{y$_M$}$ ，其中 $\hat{y$_i$}$ ∈T。

要进行预测，句子传递上使用LSTM模型。在时间步长i表示隐藏状态为hi。另外，为每个标记分配一个唯一的索引（就像我们在单词embeddings部分中使用word_to_ix一样）。然后我们对 $\hat{y$_i$}$ 的预测规则是

$\hat{y$_i$}=argmax$_j$(log Softmax(Ah$_i$+b))$_j$$

也就是说，取隐藏状态的仿射映射的log softmax，并且预测标签是在该向量中具有最大值的标签。注意，这意味着A的目标空间的维度立即是| T |。

准备数据：

def prepare_sequence(seq, to_ix):
    idxs = [to_ix[w] for w in seq]
    return torch.tensor(idxs, dtype=torch.long)


training_data = [
    ("The dog ate the apple".split(), ["DET", "NN", "V", "DET", "NN"]),
    ("Everybody read that book".split(), ["NN", "V", "DET", "NN"])
]
word_to_ix = {}
for sent, tags in training_data:
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)
print(word_to_ix)
tag_to_ix = {"DET": 0, "NN": 1, "V": 2}

# These will usually be more like 32 or 64 dimensional.
# We will keep them small, so we can see how the weights change as we train.
EMBEDDING_DIM = 6
HIDDEN_DIM = 6

输出：

{'The': 0, 'dog': 1, 'ate': 2, 'the': 3, 'apple': 4, 'Everybody': 5, 'read': 6, 'that': 7, 'book': 8}

创建模型：

class LSTMTagger(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim

        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1))
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

训练模型：

model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word_to_ix), len(tag_to_ix))
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# See what the scores are before training
# Note that element i,j of the output is the score for tag j for word i.
# Here we don't need to train, so the code is wrapped in torch.no_grad()
# 查看训练前的数据
with torch.no_grad():
    inputs = prepare_sequence(training_data[0][0], word_to_ix)
    tag_scores = model(inputs)
    print(tag_scores)

for epoch in range(300):  # again, normally you would NOT do 300 epochs, it is toy data
    for sentence, tags in training_data:
        # Step 1. Remember that Pytorch accumulates gradients.
        # We need to clear them out before each instance

        #初始化
        model.zero_grad()

        # Step 2. Get our inputs ready for the network, that is, turn them into
        # Tensors of word indices.
        # 获得输入
        sentence_in = prepare_sequence(sentence, word_to_ix)
        targets = prepare_sequence(tags, tag_to_ix)

        # Step 3. Run our forward pass.
        #跑网络
        tag_scores = model(sentence_in)

        # Step 4. Compute the loss, gradients, and update the parameters by
        #  calling optimizer.step()
        # 计算loss，更新参数
        loss = loss_function(tag_scores, targets)
        loss.backward()
        optimizer.step()

# See what the scores are after training
with torch.no_grad():
    inputs = prepare_sequence(training_data[0][0], word_to_ix)
    tag_scores = model(inputs)

    # The sentence is "the dog ate the apple".  i,j corresponds to score for tag j
    # for word i. The predicted tag is the maximum scoring tag.
    # Here, we can see the predicted sequence below is 0 1 2 0 1
    # since 0 is index of the maximum value of row 1,
    # 1 is the index of maximum value of row 2, etc.
    # Which is DET NOUN VERB DET NOUN, the correct sequence!
    print(tag_scores)

最后的输出：

练习：

使用字符级功能扩充LSTM词性标注器（ Augmenting the LSTM part-of-speech tagger with character-level features）

在上面的例子中，每个单词都有一个嵌入，它作为我们序列模型的输入。让我们用从单词的字符派生的表示来扩充单词嵌入。我们希望这应该有很大帮助，因为像词缀这样的字符级信息对词性有很大的影响。例如，带有词缀的词几乎总是用英语标记为副词。

为此，让 $c$_w$$ 成为单词w的字符级表示。让 $x$_w$$ 像以前一样嵌入。然后我们的序列模型的输入是xw和cw的串联。因此，如果xw具有维度5和cw维度3，那么我们的LSTM应该接受维度8的输入。

要获得字符级别表示，请对单词的字符执行LSTM，并让cw成为此LSTM的最终隐藏状态。

提示：

新模型中将会有两个LSTM。输出POS标签分数的原始分数，以及输出每个单词的字符级表示的新分数。
要对字符执行序列模型，您必须嵌入字符。字符嵌入将是字符LSTM的输入。

# -*- coding:utf8 -*-
import torch
import torch.nn as nn
from torch.autograd import Variable
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)


def prepare_sequence(seq, to_ix):
    idxs = [to_ix[w] for w in seq]
    tensor = torch.LongTensor(idxs)
    return Variable(tensor)


training_data = [
    ("The dog ate the apple".split(), ["DET", "NN", "V", "DET", "NN"]),
    ("Everybody read that book".split(), ["NN", "V", "DET", "NN"])
]
word_to_ix = {}
for sent, tags in training_data:
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)

char_to_ix = {}
char_to_ix[' '] = len(char_to_ix)
for sent, _ in training_data:
    for word in sent:
        for char in word:
            if char not in char_to_ix:
                char_to_ix[char] = len(char_to_ix)

print(char_to_ix)
print('len(char_to_ix):',len(char_to_ix))
print(word_to_ix)

tag_to_ix = {"DET": 0, "NN": 1, "V": 2}


class LSTMTagger(nn.Module):
    def __init__(self, word_emb_dim, char_emb_dim, hidden_dim, vocab_size, tagset_size, char_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim
        self.char_emb_dim = char_emb_dim

        self.word_embedding = nn.Embedding(vocab_size, word_emb_dim)
        self.char_embedding = nn.Embedding(char_size, char_emb_dim)
        self.char_lstm = nn.LSTM(char_emb_dim, char_emb_dim)
#LSTM输入层的维度
        self.lstm = nn.LSTM(word_emb_dim + char_emb_dim, hidden_dim)
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)

    def forward(self, sentence_word, sentence_char, MAX_WORD_LEN):
        # char emb
        sentence_size = sentence_word.size()[0]
        char_emb = self.char_embedding(sentence_char)  # [sentence_size * MAX_WORD_LEN, char_emb_dim]
        try:
            char_emb = char_emb.view(len(sentence_word), MAX_WORD_LEN, -1).permute(1, 0,
                                                                                   2)  # [MAX_WORD_LEN, sentence_size, char_emb_dim]
        except:
            print("char_emb.size():", char_emb.size())

        self.hidden_char = self.initHidden_char(sentence_size)
        char_lstm_out, self.hidden = self.char_lstm(char_emb, self.hidden_char)
        char_embeded = char_lstm_out[-1, :, :].view(sentence_size, -1)

        # word emb
        word_embeded = self.word_embedding(sentence_word)

        embeded = torch.cat((word_embeded, char_embeded), dim=1)
        # print('embeded size:\n', embeded.size())
        self.hidden = self.initHidden()
        lstm_out, self.hidden = self.lstm(embeded.view(sentence_size, 1, -1), self.hidden)
        tag_space = self.hidden2tag(lstm_out.view(sentence_size, -1))
        tag_scores = F.log_softmax(tag_space)
        return tag_scores

    def initHidden(self):
        result = (Variable(torch.zeros(1, 1, self.hidden_dim)),
                  Variable(torch.zeros(1, 1, self.hidden_dim)))
        return result

    def initHidden_char(self, sentence_size):
        result = (Variable(torch.zeros(1, sentence_size, self.char_emb_dim)),
                  Variable(torch.zeros(1, sentence_size, self.char_emb_dim)))
        return result


# These will usually be more like 32 or 64 dimensional.
# We will keep them small, so we can see how the weights change as we train.
WORD_EMB_DIM = 6
CHAR_EMB_DIM = 3
HIDDEN_DIM = 6
MAX_WORD_LEN = 8

model = LSTMTagger(WORD_EMB_DIM, CHAR_EMB_DIM, HIDDEN_DIM, len(word_to_ix), len(tag_to_ix), len(char_to_ix))
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# before training
print('before training')
sentence_word = prepare_sequence(training_data[0][0], word_to_ix)
sent_chars = []
for w in training_data[0][0]:
    sps = ' ' * (MAX_WORD_LEN - len(w))
    sent_chars.extend(list(sps + w) if len(w) < MAX_WORD_LEN else list(w[:MAX_WORD_LEN]))
sentence_char = prepare_sequence(sent_chars, char_to_ix)

tag_scores = model(sentence_word, sentence_char, MAX_WORD_LEN)
targets = prepare_sequence(training_data[0][1], tag_to_ix)
print(tag_scores)
print('targets:\n', targets)

for epoch in range(300):
    for sentence, tags in training_data:
        model.zero_grad()
        model.hidden = model.initHidden()
        sentence_word = prepare_sequence(sentence, word_to_ix)
        sent_chars = []
        for w in sentence:
            sps = ' ' * (MAX_WORD_LEN - len(w))
            sent_chars.extend(list(sps + w) if len(w) < MAX_WORD_LEN else list(w[:MAX_WORD_LEN]))
        sentence_char = prepare_sequence(sent_chars, char_to_ix)
        # sentence_char = prepare_char(sentence, char_to_ix, max_length=7)

        targets = prepare_sequence(tags, tag_to_ix)
        tag_scores = model(sentence_word, sentence_char, MAX_WORD_LEN)
        loss = loss_function(tag_scores, targets)
        loss.backward()
        optimizer.step()

# after training
print('after training')
sentence_word = prepare_sequence(training_data[0][0], word_to_ix)
sent_chars = []
for w in training_data[0][0]:
    sps = ' ' * (MAX_WORD_LEN - len(w))
    sent_chars.extend(list(sps + w) if len(w) < MAX_WORD_LEN else list(w[:MAX_WORD_LEN]))
sentence_char = prepare_sequence(sent_chars, char_to_ix)

tag_scores = model(sentence_word, sentence_char, MAX_WORD_LEN)
targets = prepare_sequence(training_data[0][1], tag_to_ix)
print(tag_scores)
print('targets:\n', targets)

# These will usually be more like 32 or 64 dimensional.
# We will keep them small, so we can see how the weights change as we train.
WORD_EMB_DIM = 6
CHAR_EMB_DIM = 3
HIDDEN_DIM = 6
MAX_WORD_LEN = 8

model = LSTMTagger(WORD_EMB_DIM, CHAR_EMB_DIM, HIDDEN_DIM, len(word_to_ix), len(tag_to_ix), len(char_to_ix))
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# before training
print('before training')
sentence_word = prepare_sequence(training_data[0][0], word_to_ix)
sent_chars = []
for w in training_data[0][0]:
    sps = ' ' * (MAX_WORD_LEN - len(w))
    sent_chars.extend(list(sps + w) if len(w) < MAX_WORD_LEN else list(w[:MAX_WORD_LEN]))
sentence_char = prepare_sequence(sent_chars, char_to_ix)

tag_scores = model(sentence_word, sentence_char, MAX_WORD_LEN)
targets = prepare_sequence(training_data[0][1], tag_to_ix)
print(tag_scores)
print('targets:\n',targets)

for epoch in range(300):
    for sentence, tags in training_data:
        model.zero_grad()
        model.hidden = model.initHidden()
        sentence_word = prepare_sequence(sentence, word_to_ix)
        sent_chars = []
        for w in sentence:
            sps = ' ' * (MAX_WORD_LEN - len(w))
            sent_chars.extend(list(sps + w) if len(w)<MAX_WORD_LEN else list(w[:MAX_WORD_LEN]))
        sentence_char = prepare_sequence(sent_chars, char_to_ix)
        # sentence_char = prepare_char(sentence, char_to_ix, max_length=7)

        targets = prepare_sequence(tags, tag_to_ix)
        tag_scores = model(sentence_word, sentence_char, MAX_WORD_LEN)
        loss = loss_function(tag_scores, targets)
        loss.backward()
        optimizer.step()

# after training
print('after training')
sentence_word = prepare_sequence(training_data[0][0], word_to_ix)
sent_chars = []
for w in training_data[0][0]:
    sps = ' ' * (MAX_WORD_LEN - len(w))
    sent_chars.extend(list(sps + w) if len(w) < MAX_WORD_LEN else list(w[:MAX_WORD_LEN]))
sentence_char = prepare_sequence(sent_chars, char_to_ix)

tag_scores = model(sentence_word, sentence_char, MAX_WORD_LEN)
targets = prepare_sequence(training_data[0][1], tag_to_ix)
print(tag_scores)
print('targets:\n',targets)

最终得到的输出结果：