Seq2Seq用于LaTeX生成

最新推荐文章于 2024-08-08 07:13:48 发布

zonas.wang

最新推荐文章于 2024-08-08 07:13:48 发布

阅读量1.7k

点赞数 2

分类专栏：计算机视觉文章标签： OCR 公式识别 im2latex

计算机视觉专栏收录该内容

3 篇文章 3 订阅

订阅专栏

这篇文章是关于im2latex的系列文章中的第二篇：它的目标是解释

如何使用seq2seq模型进行LaTeX生成
如何在Tensorflow中实现它。

如果你不熟悉seq2seq

转到第一部分

从图像生成LaTeX代码
代码可以在github上找到。虽然该设计被用于图像到LaTeX的转换（im2latex挑战），但它只需很少的改动就可以用于标准seq2seq。

引言

作为一名工科学生，我问过自己多少次

如果我能拍一张我数学作业的照片，然后用它制作一个漂亮的Latex文件，那该有多棒啊!

这个想法一直困扰着我很长一段时间(我相信我不是唯一一个)，自从我开始在斯坦福学习以来，我一直渴望自己解决这个问题。除了哈佛的NLP小组和这个很酷的网站所做的一些工作之外，很难再找到更多的解决方案。我觉得这个问题可能不是那么容易解决的，所以我选择等到令人惊奇的计算机视觉课程来解决这个问题。

这个问题是关于从一个图像生成一个符号序列，因此处于计算机视觉和自然语言处理的交叉点。

方法

第一部分介绍了应用于机器翻译的sequence-to-sequence的概念。同样的框架也适用于我们的Latex生成问题。输入序列将被替换为图像，并使用一些适用于OCR的卷积模型进行预处理(从某种意义上说，如果我们将图像的像素展开成一个序列，这是完全相同的问题)。这个想法在“为图像生成标题”的任务上被证明是有效的(见参考文献Show, Attend and Tell)。基于哈佛NLP团队的一些出色工作，我和我的队友Romain选择了类似的方法。

保留seq2seq框架，但是用图像上的卷积网络替换编码器!

很难找到此类模型的良好Tensorflow实现。随着这篇文章的发表，我将发布这段代码，希望有些人会觉得它有用。您可以使用它来训练您自己的图像标题模型，或者将其调整为更高级的用途。这段代码并不依赖于Tensorflow Seq2Seq库，因为它在项目进行时还没有完全准备好，我也想要更多的灵活性(但是采用了类似的接口)。

我们将假设您熟悉第一部分中介绍的Seq2Seq

数据

为了训练我们的模型，我们需要带标签的样本:公式的图像以及用于生成图像的LaTeX代码。Latex代码的一个好来源是arxiv，它有数千篇.tex格式的文章。应用一些启发式方法在.tex文件中查找方程式后，只保留实际编译的方程式，哈佛大学的NLP小组提取出来了大约100000个公式。

等等，你没有问题吗?不同的Latex编码可以得到相同的图像。

很好的观点：(x^2 + 1)和\left( x^{2} + 1 \right)确实给出了相同的输出。这就是为什么哈佛的论文发现使用解析器（KaTeX）对数据进行规范化可以提高性能。它强制采用一些约定，比如用x ^ { 2 }代替x^2，等等。经过标准化之后，它们最终得到一个.txt文件，其中每一行包含一个公式，如下所示
\alpha + \beta
\frac { 1 } { 2 }
\frac { \alpha } { \beta }
1 + 2
从这个文件中，我们将生成0.png、1.png等图像，以及将图像文件映射到公式的索引(等于行号)的匹配文件
0.png 0
1.png 1
2.png 2
3.png 3
我们使用这种格式的原因是它是灵活的，允许您使用来自哈佛的预构建数据集(您可能需要使用如下所述的预处理脚本)。您还需要安装pdflatex和ImageMagick。
我们还构建了一个词汇表，将LaTeX标记映射到将作为模型输入的索引。如果我们保留与上面相同的数据，我们的词汇表就会如此
+ 1 2 \alpha \beta \frac { }

模型

我们的模型将依赖于适应图像的Seq2Seq模型的变体。首先，定义图形的输入。毫不奇怪，我们得到了一批shape为[H,W]的黑白图像和一批公式作为输入(ids是LaTeX的索引):

# batch of images, shape = (batch size, height, width, 1)
img = tf.placeholder(tf.uint8, shape=(None, None, None, 1), name='img')
# batch of formulas, shape = (batch size, length of the formula)
formula = tf.placeholder(tf.int32, shape=(None, None), name='formula')
# for padding
formula_length = tf.placeholder(tf.int32, shape=(None, ), name='formula_length')

关于图像输入类型的特殊说明。您可能注意到我们使用tf.uint8。这是因为我们的图像是用灰度编码的(来自0-255之间的整数，一共 $2^8=256$ 个整数)。即使我们可以将tf.float32 张量作为Tensorflow的输入，但就内存带宽而言，这要高出4倍。由于数据匮乏是gpu的主要瓶颈之一，这个简单的技巧可以为我们节省一些训练时间。为了进一步改进数据管道，请看一下新的TensorFlow数据管道。

编码

高层次的思想是在图像上应用卷积网络，将输出平铺成向量序列 $e_1, ..., e_n]$ ，每一个对应于输入图像的一个区域。这些向量将对应于我们用于翻译的LSTM的隐藏向量。

一旦我们的图像被转换成序列，我们就可以使用seq2seq模型了!

卷积编码器-产生一系列矢量
我们需要从图像中提取特征，对于这一点，没有什么比卷积更有效的了。在这里，除了我们选择了一些已经被证明对光学字符识别(OCR)有效的架构之外，没有什么可说的，OCR堆叠了卷积层和最大池化层来生成一个形状为[H’, W’, 512]的张量。

# casting the image back to float32 on the GPU
img = tf.cast(img, tf.float32) / 255.

out = tf.layers.conv2d(img, 64, 3, 1, "SAME", activation=tf.nn.relu)
out = tf.layers.max_pooling2d(out, 2, 2, "SAME")

out = tf.layers.conv2d(out, 128, 3, 1, "SAME", activation=tf.nn.relu)
out = tf.layers.max_pooling2d(out, 2, 2, "SAME")

out = tf.layers.conv2d(out, 256, 3, 1, "SAME", activation=tf.nn.relu)

out = tf.layers.conv2d(out, 256, 3, 1, "SAME", activation=tf.nn.relu)
out = tf.layers.max_pooling2d(out, (2, 1), (2, 1), "SAME")

out = tf.layers.conv2d(out, 512, 3, 1, "SAME", activation=tf.nn.relu)
out = tf.layers.max_pooling2d(out, (1, 2), (1, 2), "SAME")

# encoder representation, shape = (batch size, height', width', 512)
out = tf.layers.conv2d(out, 512, 3, 1, "VALID", activation=tf.nn.relu)

现在我们已经从图像中提取了一些特征，让我们展开图像来获得一个序列，这样我们就可以使用seq2seq的框架。最后得到一个长度序列[H’ x W’]。

H, W = tf.shape(out)[1:2]
seq = tf.reshape(out, shape=[-1, H*W, 512])

通过重塑，你不是丢失了很多结构信息吗？我担心当对图像应用注意力机制时，我的解码器将不能理解原始图像中每个特征向量的位置！

事实证明，尽管存在这个问题，该模型还是成功地工作了，但这并不完全令人满意。在翻译的情况下，LSTM的隐藏状态包含由LSTM计算的一些位置信息(毕竟，LSTM本质上是顺序的)。我们能解决这个问题吗?
定位嵌入 我决定遵循Attention is All you Need这个想法，为图像表示（out）添加位置嵌入，并且具有不向模型添加任何新的可训练参数的巨大优势。这个想法是，对于图像的每个位置，我们计算一个大小为512的向量，使得它的分量是cos或sin。更正式地说，位置嵌入v在p点的第2i和第2i+1项是
$v_{2i} = sin(p / f^{2i})$ $v_{2i+1} = cos(p / f^{2i+1})$ 其中f是一些频率参数。
简单说，因为 $s i n (a + b)$ 和 $c o s (a + b)$ 可以被表示为 $s i n (b)$ , $s i n (a)$ , $c o s (b)$ 和 $c o s (a)$ 。远程嵌入的组件之间将存在线性依赖关系，从而授权模型提取相对位置信息。好消息：该技术的TensorFlow代码在tensor2tensor库中可用，因此我们只需要重用相同的函数并通过以下调用转换out

out = add_timing_signal_nd(out)

解码

现在我们有了代表了我们的输入图像的一系列向量 $e_1, ..., e_n]$ ，让我们解码它！首先，让我们来解释我们将要使用的Seq2Seq框架的变体。
在seq2seq框架中，解码器的LSTM的第一个隐藏向量，通常是编码器的LSTM的最后一个隐藏向量。这里，我们没有这样一个向量，所以一个好的选择是学习用矩阵W和向量b来计算它。
$h_{0}=\tanh \left(W \cdot\left(\frac{1}{n} \sum_{i=1}^{n} e_{i}\right)+b\right)$ 这可以通过以下逻辑在Tensorflow中完成

img_mean = tf.reduce_mean(seq, axis=1)
W = tf.get_variable("W", shape=[512, 512])
b = tf.get_variable("b", shape=[512])
h = tf.tanh(tf.matmul(img_mean, W) + b)

注意力机制 我们首先需要为序列的每个向量e计算一个得分α。我们使用以下方法
$\begin{aligned} \alpha_{t^{\prime}} &=\beta^{T} \tanh \left(W_{1} \cdot e_{t^{\prime}}+W_{2} \cdot h_{t}\right) \\ \overline{\alpha} &=\operatorname{softmax}(\alpha) \\ c_{t} &=\sum_{i=1}^{n} \alpha_{t}^{\prime} e_{t^{\prime}} \end{aligned}$ 这可以通过以下代码在TensorFlow中完成

# over the image, shape = (batch size, n, 512)
W1_e = tf.layers.dense(inputs=seq, units=512, use_bias=False)
# over the hidden vector, shape = (batch size, 512)
W2_h = tf.layers.dense(inputs=h, units=512, use_bias=False)

# sums the two contributions
a = tf.tanh(W1_e + tf.expand_dims(W2_h, axis=1))
beta = tf.get_variable("beta", shape=[512, 1], dtype=tf.float32)
a_flat = tf.reshape(a, shape=[-1, 512])
a_flat = tf.matmul(a_flat, beta)
a = tf.reshape(a, shape=[-1, n])

# compute weights
a = tf.nn.softmax(a)
a = tf.expand_dims(a, axis=-1)
c = tf.reduce_sum(a * seq, axis=1)

注意，W1_e = tf.layers.dense(inputs=seq, units=512, use_bias=False)这一行对每个解码器的时间步都是通用的，所以我们可以一劳永逸地计算它。无偏置的稠密层只是矩阵乘法。

现在我们有了注意力向量，让我们添加一个小修改并计算另一个向量 $o_{t-1}$ (比如在Luong, Pham and Manning中)，我们将使用它来进行最后的预测，并在下一步将它作为LSTM的输入。这里 $w_{t-1}$ 表示上一步生成的标记的嵌入。

$o_{t-1}$ 传递有关上一时间步的分布的一些信息，以及它对预测标记的信心。

$\begin{aligned} h_{t} &=\operatorname{LSTM}\left(h_{t-1},\left[w_{t-1}, o_{t-1}\right]\right) \\ c_{t} &=\operatorname{Attention}\left(\left[e_{1}, \ldots, e_{n}\right], h_{t}\right) \\ o_{t} &=\tanh \left(W_{3} \cdot\left[h_{t}, c_{t}\right]\right) \\ p_{t} &=\operatorname{softmax}\left(W_{4} \cdot o_{t}\right) \end{aligned}$ 现在代码：

# compute o
W3_o = tf.layers.dense(inputs=tf.concat([h, c], axis=-1), units=512, use_bias=False)
o = tf.tanh(W3_o)

# compute the logits scores (before softmax)
logits = tf.layers.dense(inputs=o, units=vocab_size, use_bias=False)
# the softmax will be computed in the loss or somewhere else

如果我仔细阅读，我注意到在解码过程的第一步，我们也需要计算一个 $o_{0}$ ，对吗？

这是一个很好的观点，我们只是使用了和生成 $h_{0}$ 相同的技术，但是权重不同。

训练

我们需要在TensorFlow的计算图中创建两个不同的输出：一个用于训练(在每个时间步骤中使用公式并提供基本事实，参见第I部分)，另一个用于测试时间(忽略关于实际公式的所有内容，并使用上一步的预测)。

AttentionCell

我们需要将重写逻辑封装到继承RNNCell的自定义单元中。我们的自定义单元将能够调用lstm单元（在_init__中初始化）。它还有一个特殊的递归状态，它结合了LSTM状态和向量 $o$ (我们需要通过它)。一种优雅的方法是为这种重复状态定义一个namedtuple:

AttentionState = collections.namedtuple("AttentionState", ("lstm_state", "o"))

class AttentionCell(RNNCell):
    def __init__(self):
        self.lstm_cell = LSTMCell(512)

    def __call__(self, inputs, cell_state):
        """
        Args:
            inputs: shape = (batch_size, dim_embeddings) embeddings from previous time step
            cell_state: (AttentionState) state from previous time step
        """
        lstm_state, o = cell_state
        # compute h
        h, new_lstm_state = self.lstm_cell(tf.concat([inputs, o], axis=-1), lstm_state)
        # apply previous logic
        c = ...
        new_o  = ...
        logits = ...

        new_state = AttentionState(new_lstm_state, new_o)
        return logits, new_state

然后，为了计算我们的输出序列，我们只需要调用LaTeX标记序列上的前一个单元。我们首先生成标记嵌入序列，并将特殊的标记<sos>连接到该序列。然后，我们调用dynamic_rnn。

# 1. get token embeddings
E = tf.get_variable("E", shape=[vocab_size, 80], dtype=tf.float32)
# special <sos> token
start_token = tf.get_variable("start_token", dtype=tf.float32, shape=[80])
tok_embeddings = tf.nn.embedding_lookup(E, formula)

# 2. add the special <sos> token embedding at the beggining of every formula
start_token_ = tf.reshape(start_token, [1, 1, dim])
start_tokens = tf.tile(start_token_, multiples=[batch_size, 1, 1])
# remove the <eos> that won't be used because we reached the end
tok_embeddings = tf.concat([start_tokens, tok_embeddings[:, :-1, :]], axis=1)

# 3. decode
attn_cell = AttentionCell()
seq_logits, _ = tf.nn.dynamic_rnn(attn_cell, tok_embeddings, initial_state=AttentionState(h_0, o_0))

损失

一切尽在代码中：

# compute - log(p_i[y_i]) for each time step, shape = (batch_size, formula length)
losses = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=seq_logits, labels=formula)
# masking the losses
mask = tf.sequence_mask(formula_length)
losses = tf.boolean_mask(losses, mask)
# averaging the loss over the batch
loss = tf.reduce_mean(losses)
# building the train op
optimizer = tf.train.AdamOptimizer(learning_rate)
train_op = optimizer.minimize(loss)

当训练过程中遍历批次时，train_op将与包含占位符数据的feed_dict一起提供给tf.Session。

在TensorFlow中解码

在使用波束搜索(Beam Search)之前，让我们先看看贪心搜索(Greedy Search)方法的Tensorflow实现

贪心搜索(Greedy Search)

虽然贪婪解码很容易概念化，但是在TensorFlow中实现它并不简单，因为您需要使用前面的预测，并且不能在公式上使用dynamic_rnn。基本上有两种解决问题的方法

修改我们的AttentionCell和AttentionState，使AttentionState也包含在前面的time步骤中预测的单词的嵌入。

AttentionState = namedtuple("AttentionState", ("lstm_state", "o", "embedding"))

 class AttentionCell(RNNCell):
     def __call__(self, inputs, cell_state):
         lstm_state, o, embbeding = cell_state
         # compute h
         h, new_lstm_state = self.lstm_cell(tf.concat([embedding, o], axis=-1), lstm_state)
         # usual logic
         logits = ...
         # compute new embeddding
         new_ids = tf.cast(tf.argmax(logits, axis=-1), tf.int32)
         new_embedding = tf.nn.embedding_lookup(self._embeddings, new_ids)
         new_state = AttentionState(new_lstm_state, new_o, new_embedding)

         return logits, new_state

这种技术有一些缺点。它不使用输入（以前是从公式中嵌入黄金标记，因此我们必须在“假”序列上调用dynamic_rnn）。另外，当您到达<eos>标记后，如何知道何时停止解码？

实现dynamic_rnn的一个变体，它不会在序列上运行，而是将上一个时间步的预测提供给计算单元，同时具有最大数量的解码步骤。这将涉及到更深入地研究TensorFlow，使用tf.while_loop。这就是我们将要使用的方法，因为它解决了第一种技术的所有问题。我们最终想要的是

attn_cell = AttentionCell(...)
# wrap the attention cell for decoding
decoder_cell = GreedyDecoderCell(attn_cell)
# call a special dynamic_decode primitive
test_outputs, _ = dynamic_decode(decoder_cell, max_length_formula+1)

好多了，不是吗?现在让我们看看GreedyDecoderCell和dynamic_decode是什么样子的。

贪心解码器单元(Greedy Decoder Cell)

我们首先将注意力单元封装在GreedyDecoderCell中，它为我们处理贪婪的逻辑，而不需要修改AttentionCell

class DecoderOutput(collections.namedtuple("DecoderOutput", ("logits", "ids"))):
    pass

class GreedyDecoderCell(object):
    def step(self, time, state, embedding, finished):
        # next step of attention cell
        logits, new_state = self._attention_cell.step(embedding, state)
        # get ids of words predicted and get embedding
        new_ids = tf.cast(tf.argmax(logits, axis=-1), tf.int32)
        new_embedding = tf.nn.embedding_lookup(self._embeddings, new_ids)
        # create new state of decoder
        new_output = DecoderOutput(logits, new_ids)
        new_finished = tf.logical_or(finished, tf.equal(new_ids,
                self._end_token))

        return (new_output, new_state, new_embedding, new_finished)

原始的动态解码(Dynamic Decode primitive)

我们需要实现一个函数dynamic_decodeDynamicDecode，它将递归地调用上面的step函数。我们使用tf.while_loop来实现这一点，当所有假设达到<eos>或时间大于最大迭代次数时，该循环就会停止。

def dynamic_decode(decoder_cell, maximum_iterations):
    # initialize variables (details on github)

    def condition(time, unused_outputs_ta, unused_state, unused_inputs, finished):
        return tf.logical_not(tf.reduce_all(finished))

    def body(time, outputs_ta, state, inputs, finished):
        new_output, new_state, new_inputs, new_finished = decoder_cell.step(
            time, state, inputs, finished)
        # store the outputs in TensorArrays (details on github)
        new_finished = tf.logical_or(tf.greater_equal(time, maximum_iterations), new_finished)

        return (time + 1, outputs_ta, new_state, new_inputs, new_finished)

    with tf.variable_scope("rnn"):
        res = tf.while_loop(
            condition,
            body,
            loop_vars=[initial_time, initial_outputs_ta, initial_state, initial_inputs, initial_finished])

    # return the final outputs (details on github)

为了清晰起见，省略了使用TensorArrays或nest.map_structure结构的一些细节，但可以在github上找到。

注意，我们将tf.while_loop放在名为rnn的范围内。这是因为dynamic_rnn也执行相同的操作，因此我们的LSTM的权重在该范围内定义。

波束搜索解码器单元(Beam Search Decoder Cell)

我们可以使用与贪婪方法相同的方法并使用dynamic_decode

让我们像对GreedyDecoderCell一样为AttentionCell创建一个新的封装器。这一次，代码将变得更加复杂，下面只是为了直观理解。注意，当从候选集合中选择前 $k$ 个假设时，我们必须知道它们使用的是哪个“开始”(=父假设)。

class BeamSearchDecoderCell(object):

    # notice the same arguments as for GreedyDecoderCell
    def step(self, time, state, embedding, finished):
        # compute new logits
        logits, new_cell_state = self._attention_cell.step(embedding, state.cell_state)

        # compute log probs of the step (- log p(w) for all words w)
        # shape = [batch_size, beam_size, vocab_size]
        step_log_probs = tf.nn.log_softmax(new_logits)

        # compute scores for the (beam_size * vocabulary_size) new hypotheses
        log_probs = state.log_probs + step_log_probs

        # get top k hypotheses
        new_probs, indices = tf.nn.top_k(log_probs, self._beam_size)

        # get ids of next token along with the parent hypothesis
        new_ids = ...
        new_parents = ...

        # compute new embeddings, new_finished, new_cell state...
        new_embedding = tf.nn.embedding_lookup(self._embeddings, new_ids)

查看github了解详细信息。其主要思想是在每个张量上添加一个波束维数，但当将其输入到AttentionCell时，我们将波束维数与批处理维数合并。使用模型计算父id和新id也涉及一些技巧。

结论

我希望你在这篇文章中学到了一些东西，无论是技术还是Tensorflow。虽然该模型取得了令人印象深刻的性能（至少在大约85％的LaTeX被重建的短公式上），它仍然提出了我在这里列出的一些问题：

我们如何评估模型的性能？我们可以使用机器翻译中的标准指标（如BLEU）来评估解码的LaTeX与参考的比较好。我们还可以选择编译预测的LaTeX序列以获得公式的图像，然后将此图像与orignal进行比较。由于公式是一个序列，计算像素方向的距离是没有意义的。哈佛大学的论文提出了一个好主意。首先，垂直切片图像。然后，比较这些切片之间的编辑距离…

如何修复曝光偏差？虽然光束搜索通常可以获得更好的结果，但它并不完美，仍然会受到曝光偏差的影响，在训练期间，模型永远不会出现错误！它也受到损失评估不匹配的影响，该模型优化了w.r.t.符号级交叉熵，而我们感兴趣的是整个句子的重建…

$\frac{d}{d s}\left.\frac{1}{\Gamma(-s)}\right|_{s=0}=-1, \quad \frac{d}{d s} \frac{1}{\Gamma(-s)}_{s=0}=-1$