lstm原文_LSTM:《Long Short-Term Memory》的翻译并解读

本文深入分析了LSTM(长短期记忆)网络,旨在解决传统递归网络在处理长时滞后问题上的挑战。LSTM通过特殊的单元结构实现了稳定的信息流动,从而能够学习跨越超过1000个时间步长的任务,同时避免了误差反向传播过程中可能出现的梯度消失或爆炸问题。与传统的BPTT和RTRL算法相比,LSTM在解决复杂任务时表现出更好的性能和更快的学习速度。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

LSTM:《Long Short-Term Memory》的翻译并解读

目录

Long Short-Term Memory

Abstract

1 INTRODUCTION

2 PREVIOUS WORK

3 CONSTANT ERROR BACKPROP

3.1 EXPONENTIALLY DECAYING ERROR

3.2 CONSTANT ERROR FLOW: NAIVE APPROACH

4 LONG SHORT-TERM MEMORY

5 EXPERIMENTS

5.1 EXPERIMENT 1: EMBEDDED REBER GRAMMAR


Long Short-Term Memory

论文原文地址01:https://arxiv.org/pdf/1506.04214.pdf地址02:https://www.bioinf.jku.at/publications/older/2604.pdf

Abstract

Learning to store information over extended time intervals via recurrent backpropagation takes a very long time, mostly due to insucient, decaying error back ow. We brie y review Hochreiter's 1991 analysis of this problem, then address it by introducing a novel, ecient, gradient-based method called Long Short-Term Memory" (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete time steps by enforcing constant error ow through constant error carrousels" within special units. Multiplicative gate units learn to open and close access to the constant error ow. LSTM is local in space and time; its computational complexity per time step and weight is O(1). Our experiments with arti cial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with RTRL, BPTT, Recurrent Cascade-Correlation, Elman nets, and Neural Sequence Chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, arti cial long time lag tasks that have never been solved by previous recurrent network algorithms.通过周期性的反向传播学习,在扩展的时间间隔内存储信息需要很长的时间,这主要是由于不确定的、衰减的错误导致的。我们简要回顾了Hochreiter在1991年对这个问题的分析,然后介绍了一种新颖的、独特的、基于梯度的方法,称为LSTM (LSTM)。在不造成伤害的情况下截断梯度,LSTM可以学习在超过1000个离散时间步长的最小时间滞后上桥接,方法是通过在特殊单元内的“恒定误差轮盘”强制执行恒定误差。乘性门单元学习打开和关闭访问的恒定误差低。LSTM在空间和时间上都是局部的;其每时间步长的计算复杂度和权值为O(1)。我们对人工数据的实验包括局部的、分布式的、实值的和有噪声的模式表示。在与RTRL、BPTT、周期性级联相关、Elman网和神经序列分块的比较中,LSTM带来了更多的成功运行,并且学习速度更快。LSTM还解决了以前的递归网络算法所不能解决的复杂、人工的长时间滞后问题。

1 INTRODUCTION

Recurrent networks can in principle use their feedback connections to store representations of recent input events in form of activations (short-term memory", as opposed to long-term memory" embodied by slowly changing weights). This is potentially signicant for many applications, including speech processing, non-Markovian control, and music composition (e.g., Mozer 1992). The most widely used algorithms for learning what to put in short-term memory, however, take too much time or do not work well at all, especially when minimal time lags between inputs and corresponding teacher signals are long. Although theoretically fascinating, existing methods do not provide clear practical advantages over, say, backprop in feedforward nets with limited time windows. This paper will review an analysis of the problem and suggest a remedy.

递归网络原则上可以使用它们的反馈连接以激活的形式存储最近输入事件的表示(“短期记忆”,而不是“长期记忆”,后者由缓慢变化的权重表示)。这对许多应用程序都有潜在的重要性,包括语音处理、非马尔可夫控制和音乐作曲(例如,Mozer 1992)。然而,最广泛使用的学习短期记忆的算法要么花费了太多时间,要么根本就不能很好地工作,尤其是在输入和相应教师信号之间的最小时滞很长时。虽然理论上很吸引人,但现有的方法并没有提供明显的实际优势,例如,在有限时间窗口的前馈网络中,backprop。本文将对这一问题进行分析,并提出解决办法。The problem. With conventional Back-Propagation Through Time" (BPTT, e.g., Williams and Zipser 1992, Werbos 1988) or Real-Time Recurrent Learning" (RTRL, e.g., Robinson and Fallside 1987), error signals owing backwards in time" tend to either (1) blow up or (2) vanish: the temporal evolution of the backpropagated error exponentially depends on the size of the weights (Hochreiter 1991). Case (1) may lead to oscillating weights, while in case (2) learning to bridge long time lags takes a prohibitive amount of time, or does not work at all (see section 3). 这个问题。与传统反向传播通过时间”(BPTT,例如,1992年威廉姆斯和拉链,Werbos 1988)或实时复发性学习”(RTRL,例如,罗宾逊和Fallside 1987),误差信号在时间上向后由于”倾向于(1)炸毁或(2):消失的时间演化backpropagated误差指数的大小取决于重量(Hochreiter 1991)。情形(1)可能会导致权值的振荡,而情形(2)学习如何桥接长时间滞后的情况会花费大量的时间,或者根本不起作用(参见第3节)。The remedy. This paper presents Long Short-Term Memory" (LSTM), a novel recurrent network architecture in conjunction with an appropriate gradient-based learning algorithm. LSTM is designed to overcome these error back- ow problems. It can learn to bridge time intervals in excess of 1000 steps even in case of noisy, incompressible input sequences, without loss of short time lag capabilities. This is achieved by an ecient, gradient-based algorithm for an architecture enforcing constant (thus neither exploding nor vanishing) error ow through internal states of special units (provided the gradient computation is truncated at certain architecture-specic points | this does not aect long-term error ow though). 补救措施。本文提出了一种新的递归网络结构——长短时记忆(LSTM),并结合适当的梯度学习算法。LSTM的设计就是为了克服这些错误的反向问题。它可以学习桥接超过1000步的时间间隔,即使在有噪声、不可压缩的输入序列的情况下,也不会损失短时间延迟能力。这是通过一种特殊的、基于梯度的算法来实现的,它针对的是一种通过特殊单元的内部状态来执行常量(因此既不会爆炸也不会消失)的错误(假设梯度计算在某些特定的体系结构点|被截断,但这并不影响长期的错误)。Outline of paper. Section 2 will brie y review previous work. Section 3 begins with an outline of the detailed analysis of vanishing errors due to Hochreiter (1991). It will then introduce a naive approach to constant error backprop for didactic purposes, and highlight its problems concerning information storage and retrieval. These problems will lead to the LSTM architecture as described in Section 4. Section 5 will present numerous experiments and comparisons with competing methods. LSTM outperforms them, and also learns to solve complex, articial tasks no other recurrent net algorithm has solved. Section 6 will discuss LSTM's limitations and advantages. The appendix contains a detailed description of the algorithm (A.1), and explicit error ow formulae (A.2).第二部分将简要回顾以前的工作。第3节以详细分析Hochreiter(1991)所造成的消失误差的大纲开始。然后,它将介绍一种用于教学目的的幼稚的不断错误支持方法,并突出其在信息存储和检索方面的问题。这些问题将导致第4节中描述的LSTM体系结构。第5节将提供大量的实验和与竞争方法的比较。LSTM比它们做得更好,而且还学会了解决复杂的人工任务,这是其他递归网络算法所不能解决的。第6节将讨论LSTM的局限性和优点。附录中有算法的详细描述(a .1),以及公式的显式误差(a .2)。


2 PREVIOUS WORK

This section will focus on recurrent nets with time-varying inputs (as opposed to nets with stationary inputs and xpoint-based gradient calculations, e.g., Almeida 1987, Pineda 1987).

本节将集中讨论具有时变输入的递归网络(而不是具有固定输入和基于x点的梯度计算的网络,例如Almeida 1987和Pineda 1987)。

Gradient-descent variants. The approaches of Elman (1988), Fahlman (1991), Williams (1989), Schmidhuber (1992a), Pearlmutter (1989), and many of the related algorithms in Pearlmutter's comprehensive overview (1995) suer from the same problems as BPTT and RTRL (see Sections 1 and 3).

梯度下降法变体。Elman(1988)、Fahlman(1991)、Williams(1989)、Schmidhuber (1992a)、Pearlmutter(1989)的方法,以及Pearlmutter的综合综述(1995)中的许多相关算法,都是从与BPTT和RTRL相同的问题中提出的(见第1节和第3节)

Time-delays. Other methods that seem practical for short time lags only are Time-Delay Neural Networks (Lang et al. 1990) and Plate's method (Plate 1993), which updates unit activations based on a weighted sum of old activations (see also de Vries and Principe 1991). Lin et al. (1995) propose variants of time-delay networks called NARX networks.

时间延迟。其他似乎只适用于短时间滞后的方法有时滞神经网络(Lang et al. 1990)和Plate法(Plate 1993),后者基于旧激活的加权和更新单位激活(参见de Vries和Principe 1991)。Lin等人(1995)提出了时延网络的变体NARX网络。

Time constants. To deal with long time lags, Mozer (1992) uses time constants in uencing changes of unit activations (deVries and Principe's above-mentioned approach (1991) may in fact be viewed as a mixture of TDNN and time constants). For long time lags, however, the time constants need external ne tuning (Mozer 1992). Sun et al.'s alternative approach (1993) updates the activation of a recurrent unit by adding the old activation and the (scaled) current net input. The net input, however, tends to perturb the stored information, which makes long-term storage impractical.

时间常量。为了处理长时间滞后,Mozer(1992)使用时间常数来表示单位激活的变化(deVries and Principe’s上述方法(1991)实际上可以看作是TDNN和时间常数的混合物)。然而,对于长时间滞后,时间常数需要外部ne调谐(Mozer 1992)。Sun等人的替代方法(1993)通过添加旧的激活和(缩放的)当前净输入来更新一个经常性单元的激活。然而,净输入往往会干扰所存储的信息,这使得长期存储变得不切实际。

Ring's approach. Ring (1993) also proposed a method for bridging long time lags. Whenever a unit in his network receives con icting error signals, he adds a higher order unit in uencing appropriate connections. Although his approach can sometimes be extremely fast, to bridge a time lag involving 100 steps may require the addition of 100 units. Also, Ring's net does not generalize to unseen lag durations.

环的方法。Ring(1993)也提出了一种桥接长时间滞后的方

### 双向LSTM(BiLSTM)的研究背景与发展 双向LSTM(Bidirectional LSTM, BiLSTM)的概念最早可以追溯到1997年Sepp Hochreiter和Jürgen Schmidhuber提出的长短期记忆网络(LSTM),以及随后在2005年由Graves等人进一步发展的时间序列建模方法[^1]。具体来说,双向LSTM的核心思想是在单向LSTM的基础上引入了一个反向处理机制,使得模型能够同时利用过去和未来的信息进行特征提取[^2]。 #### 关键研究文献 以下是几个重要的研究工作,这些文献奠定了双向LSTM的基础推动了其应用: 1. **Hochreiter & Schmidhuber (1997)** 这篇经典论文首次提出了LSTM结构,解决了传统RNN中的梯度消失/爆炸问题。虽然该文章未明确提出双向架构,但它为后续的发展提供了理论基础。 2. **Graves et al. (2005)** Alex Graves及其团队发表了一篇名为《*Framewise phoneme classification with bidirectional networks*》的文章,在语音识别领域首次正式引入了双向神经网络的思想。他们证明通过结合正向和反向传播路径可以获得更丰富的上下文信息,从而显著提升分类效果[^3]。 3. **Schuster & Paliwal (1997)** Michael Schuster和Kuldip K. Paliwal较早探索了双向动态时间规整技术,讨论了如何将其应用于模式匹配任务中。尽管他们的实现方式不同于现代意义上的BiLSTM,但仍被认为是这一概念的重要先驱之一[^4]。 4. **Sutskever et al. (2014)** Ilya Sutskever等人的工作展示了深层双向编码器-解码器框架对于机器翻译的强大能力。其中使用的Encoder部分即包含了典型的BiLSTM组件,用来捕捉源句子的整体语义表示[^5]。 以上提到的几项研究成果共同构成了当前广泛采用的标准版本——即由两个独立运作但共享权重参数集的传统LSTM单元组成的一个整体系统;其中一个按正常时间步执行计算操作(Forward Pass),另一个则沿相反方向运行(Backward Pass)[^6]。 此外值得注意的是,随着深度学习工具库如TensorFlow、PyTorch等日益普及,开发者们现在可以通过简单配置相应超参即可轻松搭建起属于自己的定制化版bi-lstm模型实例[^7]: ```python import paddle.nn as nn class BiLSTMModel(nn.Layer): def __init__(self, input_size, hidden_size, num_layers=1): super(BiLSTMModel, self).__init__() self.lstm = nn.LSTM(input_size=input_size, hidden_size=hidden_size, direction='bidirectional', num_layers=num_layers) def forward(self, inputs): outputs, _ = self.lstm(inputs) return outputs ``` 上述代码片段展示了一个基本形式下的PaddlePaddle平台上的双向往复长短时记忆网络类定义过程示例[^8]。 --- ###
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值