Pytorch之经典神经网络RNN(三) —— LSTM(simple data)(手写LSTM&反向传播)

hxxjxw

已于 2023-01-27 00:19:47 修改

阅读量3.5k

点赞数 2

文章标签： RNN MNIST

于 2020-08-13 11:13:33 首次发布

本文链接：https://blog.csdn.net/hxxjxw/article/details/107968720

版权

1997年提出的

LSTM是一种特殊的RNN，表现突出。很好地解决了训练RNN过程中的各种问题，在几乎各类问题中都展现出远好于Vanilla RNN的表现

LSTM 和基本的 RNN 是一样的，他的参数也是相同的

长期依赖（Long-Term Dependencies）问题

长期依赖（Long-Term Dependencies）问题也就是我们在RNN中说的，由于梯度弥散和梯度爆炸，RNN不具备长期记忆，而只具备短期记忆的问题

有时候，我们仅仅需要知道先前的信息来执行当前的任务。例如，我们有一个语言模型用来基于先前的词来预测下一个词。如果我们试着预测 “the clouds are in the sky” 最后的词，我们并不需要任何其他的上下文 —— 因此下一个词很显然就应该是 sky。在这样的场景中，相关的信息和预测的词位置之间的间隔是非常小的，RNN 可以学会使用先前的信息。

但是同样会有一些更加复杂的场景。假设我们试着去预测“I grew up in France... I speak fluent French”最后的词。当前的信息建议下一个词可能是一种语言的名字，但是如果我们需要弄清楚是什么语言，我们是需要先前提到的离当前位置很远的 France 的上下文的。这说明相关信息和当前预测位置之间的间隔就肯定变得相当的大。

不幸的是，在这个间隔不断增大时，RNN 会丧失学习到连接如此远的信息的能力。

但是，LSTM能够解决这个问题！

LSTM —— 长短期记忆网络

LSTM是一种特殊的RNN，主要通过三个门控逻辑实现(遗忘、输入、输出)。它的提出就是为了解决长序列训练过程中的梯度消失和梯度爆炸问题

其核心关键在于：

提出了门机制：遗忘门、输入门、输出门；
细胞状态：在RNN中只有隐藏状态的传播，而在LSTM中，引入了细胞状态。

如图所示，其中相较于传统RNN单元，LSTM不仅有hidden-state h，还有细胞状态cell-state C. 而其中sigmoid 则被称为门gate(值为0代表不通过任何信息，值为1代表全部通过)，通过乘运算与和运算实现数据的合并与过滤。每个LSTM单元的输出有两个，一个是下面的ht，一个是上面的ct。ct的存在能很好地抑制梯度消失和梯度爆炸问题。

LSTM的核心

LSTM的核心是细胞状态，表示细胞状态的这条线水平的穿过图的顶部。

细胞的状态类似于输送带，细胞的状态在整个链上运行，只有一些小的线性操作作用其上，信息很容易保持不变的流过整个链。

LSTM确实具有删除或添加信息到细胞状态的能力，这个能力是由被称为门(Gate)的结构所赋予的。

门(Gate)是一种可选地让信息通过的方式。它由一个Sigmoid神经网络层和一个点乘法运算组成。

Sigmoid神经网络层输出0和1之间的数字，这个数字描述每个组件有多少信息可以通过， 0表示不通过任何信息，1表示全部通过

LSTM的遗忘、输入、输出三个门，用于保护和控制细胞的状态。

LSTM分层结构

LSTM结构图中每一行都带有一个向量，该向量从一个节点输出到其他节点的输入。粉红色圆圈表示点向运算，如向量加法、点乘，而黄色框是学习神经网络层。线的合并表示连接，而线的交叉表示其内容正在复制，副本将转到不同的位置。

为什么LSTM能解决梯度弥散

原始RNN的ht-1到ht没有一个直通的通道，都必须经过Whh，才造成了Whh^k的情形

而LSTM有了一个memory直通的通道，有点类似于Resnet

另外LSTM之中采用了sigmoid作激活函数

当LSTM网络很深时且使用tanh作为激活函数，可能会引起梯度消失的问题。

Pytorch LSTM

Pytorch — LSTM_hxxjxw的博客-CSDN博客

手写LSTM & 反传

import numpy as np


def rel_error(x, y):
    """ returns relative error """
    return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))

def eval_numerical_gradient_array(f, x, df, h=1e-5):
    """
    Evaluate a numeric gradient for a function that accepts a numpy
    array and returns a numpy array.
    """
    grad = np.zeros_like(x)
    it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])
    while not it.finished:
        ix = it.multi_index

        oldval = x[ix]
        x[ix] = oldval + h
        pos = f(x).copy()
        x[ix] = oldval - h
        neg = f(x).copy()
        x[ix] = oldval

        grad[ix] = np.sum((pos - neg) * df) / (2 * h)
        it.iternext()
    return grad

def sigmoid(x):
    """
    A numerically stable version of the logistic sigmoid function.
    """
    pos_mask = (x >= 0)
    neg_mask = (x < 0)
    z = np.zeros_like(x)
    z[pos_mask] = np.exp(-x[pos_mask])
    z[neg_mask] = np.exp(x[neg_mask])
    top = np.ones_like(x)
    top[neg_mask] = z[neg_mask]
    return top / (1 + z)

def lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b):
    """
    Forward pass for a single timestep of an LSTM.

    The input data has dimension D, the hidden state has dimension H, and we use
    a minibatch size of N.

    Inputs:
    - x: Input data, of shape (N, D)
    - prev_h: Previous hidden state, of shape (N, H)
    - prev_c: previous cell state, of shape (N, H)
    - Wx: Input-to-hidden weights, of shape (D, 4H)
    - Wh: Hidden-to-hidden weights, of shape (H, 4H)
    - b: Biases, of shape (4H,)

    Returns a tuple of:
    - next_h: Next hidden state, of shape (N, H)
    - next_c: Next cell state, of shape (N, H)
    - cache: Tuple of values needed for backward pass.
    """
    next_h, next_c, cache = None, None, None
    #############################################################################
    # TODO: Implement the forward pass for a single timestep of an LSTM.        #
    # You may want to use the numerically stable sigmoid implementation above.  #
    #############################################################################
    H = prev_h.shape[1]
    z = x.dot(Wx) + prev_h.dot(Wh) + b
    i = sigmoid(z[:, 0:H])
    f = sigmoid(z[:, H:2*H])
    o = sigmoid(z[:, 2*H:3*H])
    g = np.tanh(z[:, 3*H:4*H])
    next_c = f * prev_c + i * g
    next_h = o * np.tanh(next_c)
    cache = (i, f, o, g, next_c, prev_c, prev_h, Wx, Wh, x)
    ##############################################################################
    #                               END OF YOUR CODE                             #
    ##############################################################################

    return next_h, next_c, cache


def lstm_step_backward(dnext_h, dnext_c, cache):
    """
    Backward pass for a single timestep of an LSTM.

    Inputs:
    - dnext_h: Gradients of next hidden state, of shape (N, H)
    - dnext_c: Gradients of next cell state, of shape (N, H)
    - cache: Values from the forward pass

    Returns a tuple of:
    - dx: Gradient of input data, of shape (N, D)
    - dprev_h: Gradient of previous hidden state, of shape (N, H)
    - dprev_c: Gradient of previous cell state, of shape (N, H)
    - dWx: Gradient of input-to-hidden weights, of shape (D, 4H)
    - dWh: Gradient of hidden-to-hidden weights, of shape (H, 4H)
    - db: Gradient of biases, of shape (4H,)
    """
    dx, dh, dc, dWx, dWh, db = None, None, None, None, None, None
    #############################################################################
    # TODO: Implement the backward pass for a single timestep of an LSTM.       #
    #                                                                           #
    # HINT: For sigmoid and tanh you can compute local derivatives in terms of  #
    # the output value from the nonlinearity.                                   #
    #############################################################################
    (i, f, o, g, next_c, prev_c, prev_h, Wx, Wh, x) = cache
    do = dnext_h * np.tanh(next_c) * o * (1 - o) 
    dh2c = dnext_h * o * (1 - np.tanh(next_c) ** 2)
    di = (dnext_c * g + dh2c * g) * i * (1 - i) 
    df = (dnext_c * prev_c + dh2c * prev_c) * f * (1 - f) 
    dg = (dnext_c * i + dh2c * i) * (1 - g ** 2) 
    dprev_c = dnext_c * f + dh2c * f 
    d = np.hstack((di, df, do, dg)) #(N, 4H)
    
    dWx = x.T.dot(d) #(D,N) * (N,4H)
    dWh = prev_h.T.dot(d) #(H, N) * (N,4H)
    db = np.sum(d, axis=0)
    dprev_h = d.dot(Wh.T) # (N,4H) * (4H,H) =(N,H)
    dx = d.dot(Wx.T)# =(N,4H) * (4H,H) =(N,H)
    ##############################################################################
    #                               END OF YOUR CODE                             #
    ##############################################################################

    return dx, dprev_h, dprev_c, dWx, dWh, db


def lstm_forward(x, h0, Wx, Wh, b):
    """
    Forward pass for an LSTM over an entire sequence of data. We assume an input
    sequence composed of T vectors, each of dimension D. The LSTM uses a hidden
    size of H, and we work over a minibatch containing N sequences. After running
    the LSTM forward, we return the hidden states for all timesteps.

    Note that the initial cell state is passed as input, but the initial cell
    state is set to zero. Also note that the cell state is not returned; it is
    an internal variable to the LSTM and is not accessed from outside.

    Inputs:
    - x: Input data of shape (N, T, D)
    - h0: Initial hidden state of shape (N, H)
    - Wx: Weights for input-to-hidden connections, of shape (D, 4H)
    - Wh: Weights for hidden-to-hidden connections, of shape (H, 4H)
    - b: Biases of shape (4H,)

    Returns a tuple of:
    - h: Hidden states for all timesteps of all sequences, of shape (N, T, H)
    - cache: Values needed for the backward pass.
    """
    h, cache = None, None
    #############################################################################
    # TODO: Implement the forward pass for an LSTM over an entire timeseries.   #
    # You should use the lstm_step_forward function that you just defined.      #
    #############################################################################
    N, T, D = x.shape
    H = h0.shape[1]
    prev_h = h0
    h = np.zeros((N, T, H))
    prev_c = np.zeros((N, H))
    cache = {}
    for t in range(T):
        xt = x[:, t, :]
        next_h, next_c, cache[t] = lstm_step_forward(xt, prev_h, prev_c, Wx, Wh, b)
        h[:, t, :] = next_h
        prev_h = next_h
        prev_c = next_c
    ##############################################################################
    #                               END OF YOUR CODE                             #
    ##############################################################################

    return h, cache


def lstm_backward(dh, cache):
    """
    Backward pass for an LSTM over an entire sequence of data.]

    Inputs:
    - dh: Upstream gradients of hidden states, of shape (N, T, H)
    - cache: Values from the forward pass

    Returns a tuple of:
    - dx: Gradient of input data of shape (N, T, D)
    - dh0: Gradient of initial hidden state of shape (N, H)
    - dWx: Gradient of input-to-hidden weight matrix of shape (D, 4H)
    - dWh: Gradient of hidden-to-hidden weight matrix of shape (H, 4H)
    - db: Gradient of biases, of shape (4H,)
    """
    dx, dh0, dWx, dWh, db = None, None, None, None, None
    #############################################################################
    # TODO: Implement the backward pass for an LSTM over an entire timeseries.  #
    # You should use the lstm_step_backward function that you just defined.     #
    #############################################################################
    N, T, H = dh.shape
    (i, f, o, g, next_c, prev_c, prev_h, Wx, Wh, x) = cache[T-1]
    D = x.shape[1]
    dprev_h = np.zeros((N, H))
    dx = np.zeros((N, T, D))
    dh0 = np.zeros((N, H))
    dWx= np.zeros((D, 4*H))
    dWh = np.zeros((H, 4*H))
    db = np.zeros((4*H,))
    dnext_c = np.zeros((N, H))#初始0
    for t in range(T-1, -1, -1):
        dnext_h = dh[:, t, :] + dprev_h#上面+右边,初始右边=0
        dx[:, t, :], dprev_h, dprev_c, dWxt, dWht, dbt = lstm_step_backward(dnext_h, dnext_c, cache[t])
        dnext_c = dprev_c
        dWx += dWxt #不同时刻共享的？
        dWh += dWht
        db += dbt
        dh0 = dprev_h
    ##############################################################################
    #                               END OF YOUR CODE                             #
    ##############################################################################

    return dx, dh0, dWx, dWh, db



if __name__ == '__main__':
    np.random.seed(231)

    N, D, T, H = 2, 3, 10, 6

    x = np.random.randn(N, T, D)
    h0 = np.random.randn(N, H)
    Wx = np.random.randn(D, 4 * H)
    Wh = np.random.randn(H, 4 * H)
    b = np.random.randn(4 * H)

    out, cache = lstm_forward(x, h0, Wx, Wh, b)

    dout = np.random.randn(*out.shape)

    dx, dh0, dWx, dWh, db = lstm_backward(dout, cache)

    fx = lambda x: lstm_forward(x, h0, Wx, Wh, b)[0]
    fh0 = lambda h0: lstm_forward(x, h0, Wx, Wh, b)[0]
    fWx = lambda Wx: lstm_forward(x, h0, Wx, Wh, b)[0]
    fWh = lambda Wh: lstm_forward(x, h0, Wx, Wh, b)[0]
    fb = lambda b: lstm_forward(x, h0, Wx, Wh, b)[0]

    dx_num = eval_numerical_gradient_array(fx, x, dout)
    dh0_num = eval_numerical_gradient_array(fh0, h0, dout)
    dWx_num = eval_numerical_gradient_array(fWx, Wx, dout)
    dWh_num = eval_numerical_gradient_array(fWh, Wh, dout)
    db_num = eval_numerical_gradient_array(fb, b, dout)

    print('dx error: ', rel_error(dx_num, dx))
    print('dh0 error: ', rel_error(dh0_num, dh0))
    print('dWx error: ', rel_error(dWx_num, dWx))
    print('dWh error: ', rel_error(dWh_num, dWh))
    print('db error: ', rel_error(db_num, db))

Word Embedding 词嵌入

词嵌入就是将文字转换成数字向量，因为计算机认识数字，不认识文字

词嵌入向量的意思也可以理解成：词在神经网络中的向量表示

例如，

padding_idx

自然语言中使用批处理时候, 每个句子的长度并不一定是等长的, 这时候就需要对较短的句子进行padding

......

LSTM进行句子单词词性预测

输入数据是句子和对句子中的字的词性的标签，给出新句子让网络能够预测句子中词的词性

RNN对文本等进行处理的时候，都是先将文本/字符转成数字向量，由数字向量输入LSTM等模型中，不会直接处理字符的

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

# 定义模型
class LSTMTagger(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim

        self.word_embeddings = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_dim)
        self.lstm = nn.LSTM(input_size=embedding_dim, hidden_size=hidden_dim)

        # The linear layer that maps from hidden state space to tag space，相当于一个全连接层
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1))
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))  # 把三维张量转化为二级张量
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

#数据准备
#seq是输入的句子，to_ix是单词和标签
#将输入的句子的词转化为标签输出
def prepare_sequence(seq, to_ix):
    idxs = [to_ix[w] for w in seq]
    return torch.tensor(idxs, dtype=torch.long)


#数据是句子和给句子的每个词打的标签
training_data = [
    ("The dog ate the apple".split(), ["DET", "NN", "V", "DET", "NN"]),
    ("Everybody read that book".split(), ["NN", "V", "DET", "NN"])
]
word_to_ix = {}
for sent, _ in training_data:
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)

#word_to_ix是:
#{'The': 0, 'dog': 1, 'ate': 2, 'the': 3, 'apple': 4, 'Everybody': 5, 'read': 6, 'that': 7, 'book': 8}

tag_to_ix = {"DET": 0, "NN": 1, "V": 2}

# These will usually be more like 32 or 64 dimensional.
# We will keep them small, so we can see how the weights change as we train.
EMBEDDING_DIM = 6
HIDDEN_DIM = 6


### 模型训练及预测
model = LSTMTagger(
    embedding_dim=EMBEDDING_DIM,
    hidden_dim=HIDDEN_DIM,
    vocab_size=len(word_to_ix),
    tagset_size=len(tag_to_ix)
)

loss_function = nn.NLLLoss()    # 调用时形式为：预测值(N*C),label(N)。其中N为序列中word数，C为label的类别数
optimizer = optim.SGD(model.parameters(), lr=0.1)

# See what the scores are before training
# Note that element i,j of the output is the score for tag j for word i.
# Here we don't need to train, so the code is wrapped in torch.no_grad()
with torch.no_grad():
    #training_data[0][0] 是 ['The', 'dog', 'ate', 'the', 'apple']
    #word_to_ix 是 {'The': 0, 'dog': 1, 'ate': 2, 'the': 3, 'apple': 4, 'Everybody': 5, 'read': 6, 'that': 7, 'book': 8}
    inputs = prepare_sequence(seq=training_data[0][0], to_ix=word_to_ix)

    #inputs 是 tensor([0, 1, 2, 3, 4])
    tag_scores = model(inputs)
    print('训练前的结果')
    print(tag_scores)


for epoch in range(300):  # again, normally you would NOT do 300 epochs, it is toy data
    for sentence, tags in training_data:

        model.zero_grad()

        #Get our inputs ready for the network, that is, turn them into Tensors of word indices.
        sentence_in = prepare_sequence(seq=sentence, to_ix=word_to_ix)
        # 一个sequence对应的词性标注list
        targets = prepare_sequence(seq=tags, to_ix=tag_to_ix)

        # Run our forward pass.
        tag_scores = model(sentence_in)

        #Compute the loss, gradients, and update the parameters by calling optimizer.step()
        loss = loss_function(tag_scores, targets)
        loss.backward()
        optimizer.step()


# See what the scores are after training
with torch.no_grad():
    inputs = prepare_sequence(training_data[0][0], word_to_ix)
    tag_scores = model(inputs)

    # The sentence is "the dog ate the apple".  i,j corresponds to score for tag j
    # for word i. The predicted tag is the maximum scoring tag.
    # Here, we can see the predicted sequence below is 0 1 2 0 1
    # since 0 is index of the maximum value of row 1,
    # 1 is the index of maximum value of row 2, etc.
    # Which is DET NOUN VERB DET NOUN, the correct sequence!
    print('训练后的结果')
    print("[result]tag_scores={}".format(tag_scores))

nn.LSTM(input_size , hidden_size)

nn.LSTMCell(）基本不用

LSTM细节分析理解（pytorch版） - 知乎

lstm理解与使用(pytorch为例)_hxshine的博客-CSDN博客_lstm pytorch

nn.NLLLoss()

nn.CrossEntropy，nn.NLLLoss，nn.BCELoss 都属于交叉熵

softmax(x)+log(x)+nn.NLLLoss==>nn.CrossEntropyLoss

LSTM原理及实战 - 知乎

hxxjxw

关注

2
点赞
踩
22

收藏

觉得还不错? 一键收藏
1
评论
Pytorch之经典神经网络RNN(三) —— LSTM(simple data)(手写LSTM&反向传播)

RNN 特别适合做序列类型的数据，那么 RNN 能不能想 CNN 一样用来做图像分类呢？下面我们用 mnist 手写字体的例子来展示一下如何用 RNN 做图像分类，但是这种方法并不是主流，这里我们只是作为举例。对于一张手写字体的图片，其大小是 28 * 28，我们可以将其看做是一个长为 28 的序列，每个序列的特征都是 28，也就是这样我们解决了输入序列的问题，对于输出序列怎么办呢？其实非常简单，虽然我们的输出是一个序列，但是我们只需要保留其中一个作为输出结果就可以了，这样的话肯定保留...
复制链接

扫一扫