《Attention-ocr-Chinese-Version-mas # ter》代码运行逻辑

陈壮实的搬砖生活

已于 2022-06-14 10:38:13 修改

阅读量224

点赞数

分类专栏： # OCR论文阅读文章标签： python 深度学习 pytorch OCR

于 2022-06-07 23:30:24 首次发布

本文链接：https://blog.csdn.net/qq_41915623/article/details/125175534

版权

OCR论文阅读专栏收录该内容

10 篇文章 5 订阅

订阅专栏

本文详细介绍了深度学习在图像识别和文本预测中的应用流程，包括使用InceptionV3进行特征提取，通过位置编码引入空间信息，以及利用RNN解码器和注意力机制进行文本生成。步骤涉及数据预处理、多视图卷积、位置编码、注意力机制的RNN解码和字符预测。整个过程展示了如何将深度学习技术应用于实际问题中。

摘要由CSDN通过智能技术生成

文章目录

1. 运行逻辑

2. 数据处理的走向

从tfrecord数据中获取到的数据：

images: [batch_size, height, width, channels]

labels_one_hot: [batch_size, seq_length, num_char_class], 如[32, 37, 5642]

Step1: 按宽度分成四份,输入：[batch_size, H, W, channels], 输出：4个[batch_size, H, W/4, channels]

 views = tf.split(
        value=images, num_or_size_splits=self._params.num_views, axis=2) # 按视图切分， 如原来是 20*30*40，若tf.split(my_tensor, 2, 0),则返回两个 10*30*40的小张量
      logging.debug('Views=%d single view: %s', len(views), views[0])

因为原来的代码的一幅图片有4个视图，横向排列,，所以这里是安装宽度，以视图数量进行等比例切分。

Step2：将每个视图带入到InceptionV3中进行卷积，然后进行组合。输入：4个[batch_size, H, W/4, channels],返回：[batch_size, H, W, N], N表示特征数

nets = [
        self.conv_tower_fn(v, is_training, reuse=(i != 0))    # con_tower_fn: 使用InceptionV3进行卷积，返回[batch_size，OH，OW，N]，N表示特征数
        for i, v in enumerate(views)
]

操作如图中红框部分：

Step3: 位置编码，输入[batch_size, H, W, N], 返回[batch_size, H, W, N+H+W]

位置编码是这边论文的核心，我花了大量的时间去进行理解，理解后发现非常简单。示意图如下：

对应论文中的示意图：

其编码顺序：

（1）对图像每个像素的位置（高-宽）进行编码；

（2）编码格式为one-hot编码，如：宽有两个位置，则one-hot位置占两位；高有三个位置，则占三位。

（3）如图中的像素1，其位置为（0， 0 ），则按高编码为1，0；按宽编码为1，0，0，所以其按位置编码为10100；则在末尾追加10100；又如14的位置为（1，2），则按高编码01，按宽编码001，则在其末尾追加01001。

思维扩展：个人的一些想法

既然可以对空间位置进行编码，那么如果我们的数据是一些时序相关的数据，我们是不是可以将时序按照一定规则进行编码，然后也将时序数据带入到模型中，然后看是否能够提高准确度呢？

也实现了原文中关于位置编码的代码，且做了一个小实例助理解，代码如下：

import os
os.environ['CUDA_VISIBLE_DEVICES'] = ''
import collections
import logging
import tensorflow as tf
import tf_slim as slim
from tensorflow.python.platform import flags
from tf_slim import model_analyzer
import data_provider
from PIL import Image
import matplotlib .pyplot as plt
import numpy as np
# 上面包含了一些无用的依赖包，可以根据下面的代码进行删除

if __name__ == '__main__':
    data = [
        [
            [
                [1, 2, 3, 4],
                [4, 5, 6, 7],
                [8, 9, 10, 11]
            ],
            [
                [7, 8, 9, 10],
                [10, 11, 12, 13],
                [14, 15, 16, 17]
            ]
        ],
        [
            [
                [13, 14, 15, 16],
                [16, 17, 17, 18],
                [19, 20, 21, 22]
            ],
            [
                [19, 20, 21, 19],
                [22, 23, 24, 20],
                [21, 22, 23, 24]
            ]
        ]
    ]
    net = tf.constant(data, tf.float32)
    print("net.shape = ", net.shape)
    batch_size, h, w, _ = net.shape.as_list()
    x, y = tf.meshgrid(tf.range(w), tf.range(h))
    with tf.Session() as sess:
        print("sess.run(x_tensor)；{}".format(sess.run(x)))
        print("sess.run(y_tensor)；{}".format(sess.run(y)))

    w_loc = slim.one_hot_encoding(x, num_classes=w)
    h_loc = slim.one_hot_encoding(y, num_classes=h)
    with tf.Session() as sess:
        print("sess.run(w_loc_tensor)；{}".format(sess.run(w_loc)))
        print("sess.run(h_loc_tensor)；{}".format(sess.run(h_loc)))
    loc = tf.concat([h_loc, w_loc], 2)
    loc = tf.tile(tf.expand_dims(loc, 0), [batch_size, 1, 1, 1])
    res = tf.concat([net, loc], 3)
    print("res.shape = ", res.shape)

    with tf.Session() as sess:
        print("sess.run(loc_tensor)；{}".format(sess.run(loc)))
        print("sess.run(res_tensor)；{}".format(sess.run(res)))

    print("完成")

Step4： RNN decoder with Attention. 输入：[batch_size, H, W, N+H+W], 输出：[batch_size, seq_length, num_char_classes]

到这里后，就主要是调用的库函数了。如下：

（1）sequence_layers.py

lstm_cell = tf.contrib.rnn.LSTMCell(
          self._mparams.num_lstm_units,                   # 256
          use_peepholes=False,    # 默认False，True表示启用Peephole连接。peephole是指门层也会接受细胞状态的输入，也就是说在基本的LSTM的基础上，在每一个门层的输入时加入细胞状态的输入。
          cell_clip=self._mparams.lstm_state_clip_value,  # 10，是否在输出前对cell状态按照给定值进行截断处理。
          state_is_tuple=True,    # 如果为True, 接受的和返回的状态是一个(c, h)的二元组，其中c为细胞当前状态，h为当前时间段的输出的同时
          initializer=orthogonal_initializer)   # (可选) 权重和映射矩阵的初始化器。
      lstm_outputs, _ = self.unroll_cell(
          decoder_inputs=decoder_inputs,
          initial_state=lstm_cell.zero_state(self._batch_size, tf.float32),
          loop_function=self.get_input,
          cell=lstm_cell)

（2）sequence_layers.py

  def unroll_cell(self, decoder_inputs, initial_state, loop_function, cell):
    return tf.contrib.legacy_seq2seq.attention_decoder(
        decoder_inputs=decoder_inputs,
        initial_state=initial_state,
        attention_states=self._net,
        cell=cell,
        loop_function=self.get_input)

关于Attention以及LSTM的计算，可以详见
Self-Attention
RNN
LSTM的一些解释

注：

正交初始化对于RNN很重要：

Step5: 预测字符。输入：[batch_size, seq_length, num_char_classes],输出：predicted_chars, chars_log_prob, predicted_scores

（1）model.py

# 预测
      predicted_chars, chars_log_prob, predicted_scores = (   # predicted_chars： 预测字符，形状为[batch_size x seq_length]的int32张量；
        self.char_predictions(chars_logit))                   # chars_log_prob： 所有字符的对数概率，形状为[batch_size, seq_length, num_char_classes]的浮点张量；
                                                              # predicted_scores： 字符的相应置信分数，形状为 [batch_size x seq_length]的浮点张量。

（2）model.py

通过softmax进行预测。

  # 返回预测字符的置信度得分（softmax值）。
  def char_predictions(self, chars_logit):
    """Returns confidence scores (softmax values) for predicted characters.

    Args:
      chars_logit: chars logits, a tensor with shape
        [batch_size x seq_length x num_char_classes]

    Returns:
      A tuple (ids, log_prob, scores), where:
        ids - predicted characters, a int32 tensor with shape
          [batch_size x seq_length]; 预测字符，形状为[batch_size x seq_length]的int32张量；
        log_prob - a log probability of all characters, a float tensor with
          shape [batch_size, seq_length, num_char_classes]; 所有字符的对数概率，形状为[batch_size, seq_length, num_char_classes]的浮点张量；
        scores - corresponding confidence scores for characters, a float
        tensor
          with shape [batch_size x seq_length]. 字符的相应置信分数，形状为 [batch_size x seq_length]的浮点张量。
    """
    log_prob = utils.logits_to_log_prob(chars_logit)
    ids = tf.to_int32(tf.argmax(log_prob, axis=2), name='predicted_chars')
    mask = tf.cast(
      slim.one_hot_encoding(ids, self._params.num_char_classes), tf.bool)
    all_scores = tf.nn.softmax(chars_logit)
    selected_scores = tf.boolean_mask(all_scores, mask, name='char_scores')
    scores = tf.reshape(selected_scores, shape=(-1, self._params.seq_length))
    return ids, log_prob, scores

Step6: 预测文本

（1）model.py

      if self._charset:
        character_mapper = CharsetMapper(self._charset)
        predicted_text = character_mapper.get_text(predicted_chars)   # 返回与预测字符对应的文本
      else:
        predicted_text = tf.constant([])

（2）model.py

  # 返回文本
  def get_text(self, ids):
    """Returns a string corresponding to a sequence of character ids.

        Args:
          ids: a tensor with shape [batch_size, max_sequence_length]
        """
    return tf.reduce_join(
      self.table.lookup(tf.to_int64(ids)), reduction_indices=1)

至此，这篇文章算是完全明白了运行逻辑

陈壮实的搬砖生活

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
《Attention-ocr-Chinese-Version-mas # ter》代码运行逻辑

运行逻辑train.pyStep1:prepare_training_dir准备训练参数的存储目录Step2:common_flags.create_dataset获取数据集2. 数据处理的走向从tfrecord数据中获取到的数据： images: [batch_size, height, width, channels] labels_one_hot: [batch_size, seq_length, num_char_class], 如[32, 37, 5642] 因为原
复制链接

扫一扫

专栏目录