RNN的四种代表性扩展—Attention and Augmented Recurrent Neural Networks(二)

这是RNN扩展的后两种介绍。接 Attention and Augmented Recurrent Neural Networks(一)

Adaptive Computation Time(自适应计算次数)

Standard RNNs do the same amount of computation each time step. This seems unintuitive. Surely, one should think more when things are hard? It also limits RNNs to doing O(n) operations for a list of length n.
Adaptive Computation Time (Graves, 2016), is a way for RNNs to do different amounts of computation each step. The big picture idea is simple: allow the RNN to do multiple steps of computation for each time step.
In order for the network to learn how many steps to do, we want the number of steps to be differentiable. We achieve this with the same trick we used before: instead of deciding to run for a discrete number of steps, we have an attention distribution over the number of steps to run. The output is a weighted combination of the outputs of each step.
为了让NN学习每一次运算要做多少步,我们希望步数是可分的。我们通过之前用过的技巧来做到这样的效果:相比传统的单步独立运行,我们使用一个attention distribution机制来控制每次运行的步数,每次的输出是各步输出的的联合权重。
There are a few more details, which were left out in the previous diagram. Here’s a complete diagram of a time step with three computation steps.(这里给出了每次的完整构图):
That’s a bit complicated, so let’s work through it step by step. At a high-level, we’re still running the RNN and outputting a weighted combination of the states
The weight for each step is determined by a “halting neuron”. It’s a sigmoid neuron that looks at the RNN state and gives an halting weight, which we can think of as the probability that we should stop at that step.
每步的权重由一个“halting neuron”决定。它是一个S型神经元,它看着RNN状态并给出一个停止权重,我们可以将其视为我们应该在该步骤停止的概率。
We have a total budget for the halting weights of 1, so we track that budget along the top. When it gets to less than epsilon, we stop.
When we stop, might have some left over halting budget because we stop when it gets to less than epsilon. What should we do with it? Technically, it’s being given to future steps but we don’t want to compute those, so we attribute it to the last step.
When training Adaptive Computation Time models, one adds a “ponder cost” term to the cost function. This penalizes the model for the amount of computation it uses. The bigger you make this term, the more it will trade-off performance for lowering compute time.
Adaptive Computation Time is a very new idea, but we believe that it, along with similar ideas, will be very important.
The only open source implementation of Adaptive Computation Time at the moment seems to be Mark Neumann’s (TensorFlow).

Neural Programmer(神经程序员)

Neural nets are excellent at many tasks, but they also struggle to do some basic things like arithmetic, which are trivial in normal approaches to computing. It would be really nice to have a way to fuse neural nets with normal programming, and get the best of both worlds.
The neural programmer (Neelakantan, et al., 2015) is one approach to this. It learns to create programs in order to solve a task. In fact, it learns to generate such programs without needing examples of correct programs. It discovers how to produce programs as a means to the end of accomplishing some task.
The actual model in the paper answers questions about tables by generating SQL-like programs to query the table. However, there are a number of details here that make it a bit complicated, so let’s start by imagining a slightly simpler model, which is given an arithmetic expression and generates a program to evaluate it.
The generated program is a sequence of operations. Each operation is defined to operate on the output of past operations. So an operation might be something like “add the output of the operation 2 steps ago and the output of the operation 1 step ago.” It’s more like a unix pipe than a program with variables being assigned to and read from.
The program is generated one operation at a time by a controller RNN. At each step, the controller RNN outputs a probability distribution for what the next operation should be. For example, we might be pretty sure we want to perform addition at the first time step, then have a hard time deciding whether we should multiply or divide at the second step, and so on…
The resulting distribution over operations can now be evaluated. Instead of running a single operation at each step, we do the usual attention trick of running all of them and then average the outputs together, weighted by the probability we ran that operation.
As long as we can define derivatives through the operations, the program’s output is differentiable with respect to the probabilities. We can then define a loss, and train the neural net to produce programs that give the correct answer. In this way, the Neural Programmer learns to produce programs without examples of good programs. The only supervision is the answer the program should produce.
That’s the core idea of Neural Programmer, but the version in the paper answers questions about tables, rather than arithmetic expressions. There’s a few additional neat tricks:
1.Multiple Types: Many of the operations in the Neural Programmer deal with types other than scalar numbers. Some operations output selections of table columns or selections of cells. Only outputs of the same type get merged together.
2.Referencing Inputs: The neural programmer needs to answer questions like “How many cities have a population greater than 1,000,000?” given a table of cities with a population column. To facilitate this, some operations allow the network to reference constants in the question they’re answering, or the names of columns. This referencing happens by attention, in the style of pointer networks (Vinyals, et al., 2015).

The Neural Programmer isn’t the only approach to having neural networks generate programs. Another lovely approach is the Neural Programmer-Interpreter (Reed & de Freitas, 2015) which can accomplish a number of very interesting tasks, but requires supervision in the form of correct programs.
神经程序员不是使神经网络生成程序的唯一方法。 另一个有趣的方法是Neural Programmer-Interpreter(Reed&de Freitas,2015),它可以完成许多非常有趣的任务,但需要正确程序的形式来做监督。
We think that this general space, of bridging the gap between more traditional programming and neural networks is extremely important. While the Neural Programmer is clearly not the final solution, we think there are a lot of important lessons to be learned from it.(NP不是最终,而是开始,我们可以从该思想中学到很多)
There don’t seem to be any open source implementations of the Neural Programmer at present, but there is an implementation of the Neural Programmer-Interpreter by Ken Morishita (Keras).(https://github.com/mokemokechicken/keras_npi

  • 2
  • 0
    觉得还不错? 一键收藏
  • 0


  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助




当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则
钱包余额 0


