End-To-End Memory Networks

最新推荐文章于 2024-06-27 14:36:59 发布

sohero

最新推荐文章于 2024-06-27 14:36:59 发布

阅读量1k

点赞数

文章标签： foudation

Single Layer

A layer has two memroy: input memory,output memory. Parameters are $A \in \Bbb R^{d \times |V|}$ , $B \in \Bbb R^{d \times |V|}$ , $C \in \Bbb R^{d \times |V|}$ , $W \in \Bbb R^{|V| \times d}$ .

Input set $x_1,...,x_i$ (one hot encoding or distribute encoding? probably the one hot.)

Input memory
The input memory represented by $\\{m_i\\}$ , the $m_i$ is computed by $Ax_i$ , i.e. each $m_i$ is transformed from $x_i$ using $A$ .

Query
The query $q$ is transformed to the internal state $u$ using $Bq$ . Then, compute the match probability $p_i$ between $u$ and each memory $m_i$ using softmax:

p i = s o f t m a x (u T m i)

$p_i=softmax(u^Tm_i)$
so

p $p$ is a probability vecotr over the inputs.

Output memory
Each $x_i$ is transformed to a output vecotr $c_i$ by $Cx_i$ . The response vector $o$ from the memory computed by:

o = \sum i p i c i

$o=\sum_{i} p_ic_i$

Generating final prediction
The predicted label formula:

a^= s o f t m a x (W (o + u))

$\hat a=softmax(W(o+u))$

Mutltiple Layers

With $K$ hop operations, the memory layers are stacked in the following way:

The input tot layers above the first is the sum of the output $o^k$ and the input $u^k$ from layers $k-1$ (different ways to combine $o^k$ and $u^k$ are proposed later):

uk+1=uk+ok
- Each layer has its own embedding matrices $A^k$ , $C^k$ , used to embed the inputs $\\{x_i\\}$ .
- At the top of network, the input $W$ also combines the input and the output of the top memory layer: $\hat a=softmax(Wu^{k+1})=softmax(W(o^K+u^K))$ .
- Two types of weights tying:
  - Adjacent: the output embedding for one layer is the input embedding for the one above, i.e. $A^{k+1}=C^k$ . also constrain (a) the answer prediction matrix to be the same as the final output embedding, i.e. $W^T=C^K$ , and (b) the question embedding to match the input embedding of the first layer, i.e. $B=A^1$ .
  - Layer-wise (RNN-like): the input and output embeddings are the same across different layers, i.e. $A^1=A^2=...=A^K$ and $C^1=C^2=...=C^K$ . found it useful to add a linear mapping $H$ to the update of $u$ between hops; that is, $u^{k+1}=Hu^k+o^k$ . This mapping is learned from data and used throughout experiments for layer-wise weight tying.