lstm原文_LSTM：《Long Short-Term Memory》的翻译并解读

最新推荐文章于 2025-04-18 21:01:51 发布

weixin_39831503

最新推荐文章于 2025-04-18 21:01:51 发布

阅读量8.7k

点赞数 1

文章标签： lstm原文

本文深入分析了LSTM（长短期记忆）网络，旨在解决传统递归网络在处理长时滞后问题上的挑战。LSTM通过特殊的单元结构实现了稳定的信息流动，从而能够学习跨越超过1000个时间步长的任务，同时避免了误差反向传播过程中可能出现的梯度消失或爆炸问题。与传统的BPTT和RTRL算法相比，LSTM在解决复杂任务时表现出更好的性能和更快的学习速度。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

LSTM：《Long Short-Term Memory》的翻译并解读

Long Short-Term Memory

Abstract

1 INTRODUCTION

2 PREVIOUS WORK

3 CONSTANT ERROR BACKPROP

3.1 EXPONENTIALLY DECAYING ERROR

3.2 CONSTANT ERROR FLOW: NAIVE APPROACH

4 LONG SHORT-TERM MEMORY

5 EXPERIMENTS

5.1 EXPERIMENT 1: EMBEDDED REBER GRAMMAR

Long Short-Term Memory

论文原文地址01：https://arxiv.org/pdf/1506.04214.pdf地址02：https://www.bioinf.jku.at/publications/older/2604.pdf

Abstract

Learning to store information over extended time intervals via recurrent backpropagation takes a very long time, mostly due to insucient, decaying error back ow. We brie y review Hochreiter's 1991 analysis of this problem, then address it by introducing a novel, ecient, gradient-based method called Long Short-Term Memory" (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete time steps by enforcing constant error ow through constant error carrousels" within special units. Multiplicative gate units learn to open and close access to the constant error ow. LSTM is local in space and time; its computational complexity per time step and weight is O(1). Our experiments with arti cial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with RTRL, BPTT, Recurrent Cascade-Correlation, Elman nets, and Neural Sequence Chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, arti cial long time lag tasks that have never been solved by previous recurrent network algorithms.通过周期性的反向传播学习，在扩展的时间间隔内存储信息需要很长的时间，这主要是由于不确定的、衰减的错误导致的。我们简要回顾了Hochreiter在1991年对这个问题的分析，然后介绍了一种新颖的、独特的、基于梯度的方法，称为LSTM (LSTM)。在不造成伤害的情况下截断梯度，LSTM可以学习在超过1000个离散时间步长的最小时间滞后上桥接，方法是通过在特殊单元内的“恒定误差轮盘”强制执行恒定误差。乘性门单元学习打开和关闭访问的恒定误差低。LSTM在空间和时间上都是局部的;其每时间步长的计算复杂度和权值为O(1)。我们对人工数据的实验包括局部的、分布式的、实值的和有噪声的模式表示。在与RTRL、BPTT、周期性级联相关、Elman网和神经序列分块的比较中，LSTM带来了更多的成功运行，并且学习速度更快。LSTM还解决了以前的递归网络算法所不能解决的复杂、人工的长时间滞后问题。

1 INTRODUCTION

Recurrent networks can in principle use their feedback connections to store representations of recent input events in form of activations (short-term memory", as opposed to long-term memory" embodied by slowly changing weights). This is potentially signicant for many applications, including speech processing, non-Markovian control, and music composition (e.g., Mozer 1992). The most widely used algorithms for learning what to put in short-term memory, however, take too much time or do not work well at all, especially when minimal time lags between inputs and corresponding teacher signals are long. Although theoretically fascinating, existing methods do not provide clear practical advantages over, say, backprop in feedforward nets with limited time windows. This paper will review an analysis of the problem and suggest a remedy.

递归网络原则上可以使用它们的反馈连接以激活的形式存储最近输入事件的表示(“短期记忆”，而不是“长期记忆”，后者由缓慢变化的权重表示)。这对许多应用程序都有潜在的重要性，包括语音处理、非马尔可夫控制和音乐作曲(例如，Mozer 1992)。然而，最广泛使用的学习短期记忆的算法要么花费了太多时间，要么根本就不能很好地工作，尤其是在输入和相应教师信号之间的最小时滞很长时。虽然理论上很吸引人，但现有的方法并没有提供明显的实际优势，例如，在有限时间窗口的前馈网络中，backprop。本文将对这一问题进行分析，并提出解决办法。The problem. With conventional Back-Propagation Through Time" (BPTT, e.g., Williams and Zipser 1992, Werbos 1988) or Real-Time Recurrent Learning" (RTRL, e.g., Robinson and Fallside 1987), error signals owing backwards in time" tend to either (1) blow up or (2) vanish: the temporal evolution of the backpropagated error exponentially depends on the size of the weights (Hochreiter 1991). Case (1) may lead to oscillating weights, while in case (2) learning to bridge long time lags takes a prohibitive amount of time, or does not work at all (see section 3). 这个问题。与传统反向传播通过时间”(BPTT,例如,1992年威廉姆斯和拉链,Werbos 1988)或实时复发性学习”(RTRL,例如,罗宾逊和Fallside 1987),误差信号在时间上向后由于”倾向于(1)炸毁或(2):消失的时间演化backpropagated误差指数的大小取决于重量(Hochreiter 1991)。情形(1)可能会导致权值的振荡，而情形(2)学习如何桥接长时间滞后的情况会花费大量的时间，或者根本不起作用(参见第3节)。The remedy. This paper presents Long Short-Term Memory" (LSTM), a novel recurrent network architecture in conjunction with an appropriate gradient-based learning algorithm. LSTM is designed to overcome these error back- ow problems. It can learn to bridge time intervals in excess of 1000 steps even in case of noisy, incompressible input sequences, without loss of short time lag capabilities. This is achieved by an ecient, gradient-based algorithm for an architecture enforcing constant (thus neither exploding nor vanishing) error ow through internal states of special units (provided the gradient computation is truncated at certain architecture-specic points | this does not aect long-term error ow though). 补救措施。本文提出了一种新的递归网络结构——长短时记忆(LSTM)，并结合适当的梯度学习算法。LSTM的设计就是为了克服这些错误的反向问题。它可以学习桥接超过1000步的时间间隔，即使在有噪声、不可压缩的输入序列的情况下，也不会损失短时间延迟能力。这是通过一种特殊的、基于梯度的算法来实现的，它针对的是一种通过特殊单元的内部状态来执行常量(因此既不会爆炸也不会消失)的错误(假设梯度计算在某些特定的体系结构点|被截断，但这并不影响长期的错误)。Outline of paper. Section 2 will brie y review previous work. Section 3 begins with an outline of the detailed analysis of vanishing errors due to Hochreiter (1991). It will then introduce a naive approach to constant error backprop for didactic purposes, and highlight its problems concerning information storage and retrieval. These problems will lead to the LSTM architecture as described in Section 4. Section 5 will present numerous experiments and comparisons with competing methods. LSTM outperforms them, and also learns to solve complex, articial tasks no other recurrent net algorithm has solved. Section 6 will discuss LSTM's limitations and advantages. The appendix contains a detailed description of the algorithm (A.1), and explicit error ow formulae (A.2).第二部分将简要回顾以前的工作。第3节以详细分析Hochreiter(1991)所造成的消失误差的大纲开始。然后，它将介绍一种用于教学目的的幼稚的不断错误支持方法，并突出其在信息存储和检索方面的问题。这些问题将导致第4节中描述的LSTM体系结构。第5节将提供大量的实验和与竞争方法的比较。LSTM比它们做得更好，而且还学会了解决复杂的人工任务，这是其他递归网络算法所不能解决的。第6节将讨论LSTM的局限性和优点。附录中有算法的详细描述(a .1)，以及公式的显式误差(a .2)。

2 PREVIOUS WORK

This section will focus on recurrent nets with time-varying inputs (as opposed to nets with stationary inputs and xpoint-based gradient calculations, e.g., Almeida 1987, Pineda 1987).

本节将集中讨论具有时变输入的递归网络(而不是具有固定输入和基于x点的梯度计算的网络，例如Almeida 1987和Pineda 1987)。

Gradient-descent variants. The approaches of Elman (1988), Fahlman (1991), Williams (1989), Schmidhuber (1992a), Pearlmutter (1989), and many of the related algorithms in Pearlmutter's comprehensive overview (1995) suer from the same problems as BPTT and RTRL (see Sections 1 and 3).

梯度下降法变体。Elman(1988)、Fahlman(1991)、Williams(1989)、Schmidhuber (1992a)、Pearlmutter(1989)的方法，以及Pearlmutter的综合综述(1995)中的许多相关算法，都是从与BPTT和RTRL相同的问题中提出的(见第1节和第3节)

Time-delays. Other methods that seem practical for short time lags only are Time-Delay Neural Networks (Lang et al. 1990) and Plate's method (Plate 1993), which updates unit activations based on a weighted sum of old activations (see also de Vries and Principe 1991). Lin et al. (1995) propose variants of time-delay networks called NARX networks.

时间延迟。其他似乎只适用于短时间滞后的方法有时滞神经网络(Lang et al. 1990)和Plate法(Plate 1993)，后者基于旧激活的加权和更新单位激活(参见de Vries和Principe 1991)。Lin等人(1995)提出了时延网络的变体NARX网络。

Time constants. To deal with long time lags, Mozer (1992) uses time constants in uencing changes of unit activations (deVries and Principe's above-mentioned approach (1991) may in fact be viewed as a mixture of TDNN and time constants). For long time lags, however, the time constants need external ne tuning (Mozer 1992). Sun et al.'s alternative approach (1993) updates the activation of a recurrent unit by adding the old activation and the (scaled) current net input. The net input, however, tends to perturb the stored information, which makes long-term storage impractical.

时间常量。为了处理长时间滞后，Mozer(1992)使用时间常数来表示单位激活的变化(deVries and Principe’s上述方法(1991)实际上可以看作是TDNN和时间常数的混合物)。然而，对于长时间滞后，时间常数需要外部ne调谐(Mozer 1992)。Sun等人的替代方法(1993)通过添加旧的激活和(缩放的)当前净输入来更新一个经常性单元的激活。然而，净输入往往会干扰所存储的信息，这使得长期存储变得不切实际。

Ring's approach. Ring (1993) also proposed a method for bridging long time lags. Whenever a unit in his network receives con icting error signals, he adds a higher order unit in uencing appropriate connections. Although his approach can sometimes be extremely fast, to bridge a time lag involving 100 steps may require the addition of 100 units. Also, Ring's net does not generalize to unseen lag durations.

环的方法。Ring(1993)也提出了一种桥接长时间滞后的方