SVTR源码——超详细图解

凯尔哥

已于 2024-12-16 10:39:43 修改

阅读量1.8k

点赞数 22

分类专栏：深度学习源码图解系列文章标签： ocr transformer 深度学习

于 2024-02-27 16:00:45 首次发布

本文链接：https://blog.csdn.net/weixin_43507865/article/details/136323443

版权

深度学习源码图解系列专栏收录该内容

5 篇文章

订阅专栏

本文围绕OCR模型展开，介绍了代码使用方法，包括准备数据集、配置代码和运行代码。还详细解析了模型结构，如Backbone中的PatchEmbedding、MixingBlock、PatchMerging等，以及Neck和Head部分的处理过程，最后提到输出维度和字符识别限制。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

论文地址：https://arxiv.org/pdf/2205.00159.pdf

模型代码：https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.7/ppocr/modeling/backbones/rec_svtrnet.py

1. 代码使用说明

1.1. 准备数据集

从https://rrc.cvc.uab.es/?ch=4&com=tasks下载数据集（用于调试代码）

数据的样式如下：

将下载好的数据放到相应目录。由于是文字识别任务，因此只需要 gt.txt 文件，不需要 coords.txt 文件，然后将 gt.txt 文件重命名。我这里的 train_list.txt 就是训练集对应的 gt.txt 文件，val_list.txt 就是验证集对应的 gt.txt 文件。

1.2. 配置代码

在拉取 paddleocr 项目代码后，打开configs/rec/rec_svtrnet.yml 文件，修改对应参数,有注释的地方都需要根据实际情况来调整。

Global:
  use_gpu: False    # 是否使用gpu
  epoch_num: 20
  log_smooth_window: 20
  print_batch_step: 10
  save_model_dir: ./output/rec/svtr/
  save_epoch_step: 1
  # evaluation is run every 2000 iterations after the 0th iteration
  eval_batch_step: [10, 2000] # [0, 2000]
  cal_metric_during_train: True
  pretrained_model:
  checkpoints:
  save_inference_dir:
  use_visualdl: False
  infer_img: doc/imgs_words_en/word_10.png
  # for data or label process
  character_dict_path: ../ppocr/utils/ic15_dict.txt    # 字符字典文件
  character_type: en
  max_text_length: 25
  infer_mode: False
  use_space_char: False
  save_res_path: ./output/rec/predicts_svtr_tiny.txt


Optimizer:
  name: AdamW
  beta1: 0.9
  beta2: 0.99
  epsilon: 8.e-8
  weight_decay: 0.05
  no_weight_decay_name: norm pos_embed
  one_dim_param_no_weight_decay: true
  lr:
    name: Cosine
    learning_rate: 0.0005
    warmup_epoch: 2

Architecture:
  model_type: rec
  algorithm: SVTR
  Transform:
    name: STN_ON
    tps_inputsize: [32, 64] # [32, 64]
    tps_outputsize: [32, 100] # [32, 100]
    num_control_points: 20
    tps_margins: [0.05,0.05]
    stn_activation: none
  Backbone:
    name: SVTRNet
    img_size: [32, 100] # [32, 100] 输入图片的大小
    out_char_num: 25    # 输出字符的个数
    out_channels: 192   # 输出维度
    patch_merging: 'Conv'
    embed_dim: [8, 8, 8] # [64, 128, 256]每个block的特征维度
    depth: [3, 3, 3] # [3, 6, 3]每个block的深度
    num_heads: [2, 2, 2] # [2, 4, 8]每个block有几头注意力
    mixer:  ['Local','Local','Local','Local','Local','Local','Global','Global','Global'] # ['Local','Local','Local','Local','Local','Local','Global','Global','Global','Global','Global','Global'] 每个block的种类，对应depth参数
    local_mixer: [[7, 11], [7, 11], [7, 11]] # Local Mixing的窗口大小
    last_stage: True
    prenorm: false
  Neck:
    name: SequenceEncoder
    encoder_type: reshape
  Head:
    name: CTCHead

Loss:
  name: CTCLoss

PostProcess:
  name: CTCLabelDecode

Metric:
  name: RecMetric
  main_indicator: acc

Train:
  dataset:
    name: SimpleDataSet
    data_dir: ../train_data/train    # 训练数据的路径
    label_file_list:
    - ../train_data/train_list.txt   # 训练数据的标签文件
    transforms:
      - DecodeImage: # load image
          img_mode: BGR
          channel_first: False
      - CTCLabelEncode: # Class handling label
      - SVTRRecResizeImg:
          image_shape: [3, 64, 256]
          padding: False
      - KeepKeys:
          keep_keys: ['image', 'label', 'length'] # dataloader will return list in this order
  loader:
    shuffle: True
    batch_size_per_card: 2 # 512 调整batch_size大小
    drop_last: True
    num_workers: 0 # 4

Eval:
  dataset:
    name: SimpleDataSet
    data_dir: ../train_data/val    # 验证数据的路径
    label_file_list:
    - ../train_data/val_list.txt   # 验证数据的标签文件
    transforms:
      - DecodeImage: # load image
          img_mode: BGR
          channel_first: False
      - CTCLabelEncode: # Class handling label
      - SVTRRecResizeImg:
          image_shape: [3, 64, 256]
          padding: False
      - KeepKeys:
          keep_keys: ['image', 'label', 'length'] # dataloader will return list in this order
  loader:
    shuffle: False
    drop_last: False
    batch_size_per_card: 256 # 调整batch_size大小
    num_workers: 0

还要对ppocr/data/simple_dataset.py进行修改，如下所示：

class SimpleDataSet(Dataset):
    def __init__(self, config, mode, logger, seed=None):
        super(SimpleDataSet, self).__init__()
        self.logger = logger
        self.mode = mode.lower()

        global_config = config['Global']
        dataset_config = config[mode]['dataset']
        loader_config = config[mode]['loader']
        # 需要修改为自己的分隔符，即train_list.txt和即val_list.txt文件中每行的分隔符
        self.delimiter = dataset_config.get('delimiter', ',')
        label_file_list = dataset_config.pop('label_file_list')
        data_source_num = len(label_file_list)
        ratio_list = dataset_config.get("ratio_list", 1.0)
        if isinstance(ratio_list, (float, int)):
            ratio_list = [float(ratio_list)] * int(data_source_num)
        ......

1.3. 运行代码

打开tools/train.py，配置运行参数-c ../configs/rec/rec_svtrnet.yml，运行代码。

2. 整体流程图

3. Backbone

3.1. PatchEmbedding

Patch Embeeding操作用于将输入图像分成N个patch，在实际代码中，还是执行的是卷积操作。这里的 p 为 4（32/8）， $N=\frac{HW}{p^{2} } = \frac{32*100}{4^{2} } = 200$

输入：经过预处理之后的文字图片，在我的代码中，输入数据的维度是(2,3,32,100)。其中，2 表示 batch_size；3 表示图片 channel；32 表示图片的高；100 表示图片的宽。

处理：nn.Conv2D、nn.BatchNorm2D、nn.GELU

输出：(2, 200, 8)

核心代码：x = self.proj(x).flatten(2).transpose((0, 2, 1))

其中，self.proj(x)的核心代码如下：

# 两层
def forward(self, inputs):	# (2,3,32,100)
    out = self.conv(inputs) # (2,4,16,50)
    out = self.norm(out)	# (2,4,16,50)
    out = self.act(out)		# (2,4,16,50)	
    return out

def forward(self, inputs):	# (2,4,16,50)
    out = self.conv(inputs) # (2,8,8,25)
    out = self.norm(out)	# (2,8,8,25)
    out = self.act(out)		# (2,8,8,25)
    return out

flatten 之后的 shape 为(2, 8, 200)，transpose 之后的 shape 为(2, 200, 8)。200 表示有 200 个 patch，8 表示每个 patch 的维度为 8。

x = x + self.pos_embed。其中，self.pos_embed 表示位置编码，是可学习的参数， shape 为(1, 200, 8)。

3.2. 第一个 MixingBlock

3.2.1. MixingBlock 中的第一个 LN 层

x = x + self.drop_path(self.mixer(self.norm1(x)))

3.2.2. LocalMixing（核心）

就是在各自的 window(7, 11) 中计算 attention 值，核心代码如下：

def forward(self, x):
    if self.HW is not None:
        N = self.N
        C = self.C
    else:
        _, N, C = x.shape
    qkv = self.qkv(x).reshape((0, N, 3, self.num_heads, C //
                               self.num_heads)).transpose((2, 0, 3, 1, 4))
    q, k, v = qkv[0] * self.scale, qkv[1], qkv[2]

    attn = (q.matmul(k.transpose((0, 1, 3, 2))))
    # 对于LocalMixing才需要mask
    if self.mixer == 'Local':
        attn += self.mask
    attn = nn.functional.softmax(attn, axis=-1)
    attn = self.attn_drop(attn)

    x = (attn.matmul(v)).transpose((0, 2, 1, 3)).reshape((0, N, C))
    x = self.proj(x)
    x = self.proj_drop(x)
    return x

3.2.2.1. attention mask

即上面代码中的self.mask

在局部融合中，当前位置只需要与以当前位置为中心点，同 window_size(7, 11) 内的其他位置计算 attention 值。因此需要标记出哪些位置需要计算 attention 值，哪些位置不需要计算 attention 值，attention mask 就是用来干这事的。核心代码如下：

class Attention(nn.Layer):
    def __init__(self,
                 dim,
                 num_heads=8,
                 mixer='Global',
                 HW=None,
                 local_k=[7, 11],
                 qkv_bias=False,
                 qk_scale=None,
                 attn_drop=0.,
                 proj_drop=0.):
        super().__init__()
        self.num_heads = num_heads
        head_dim = dim // num_heads
        self.scale = qk_scale or head_dim**-0.5

        self.qkv = nn.Linear(dim, dim * 3, bias_attr=qkv_bias)
        self.attn_drop = nn.Dropout(attn_drop)
        self.proj = nn.Linear(dim, dim)
        self.proj_drop = nn.Dropout(proj_drop)
        self.HW = HW
        if HW is not None:
            H = HW[0]
            W = HW[1]
            self.N = H * W
            self.C = dim
        if mixer == 'Local' and HW is not None:
            hk = local_k[0]
            wk = local_k[1]
            mask = paddle.ones([H * W, H + hk - 1, W + wk - 1], dtype='float32')
            for h in range(0, H):
                for w in range(0, W):
                    mask[h * W + w, h:h + hk, w:w + wk] = 0.
            mask_paddle = mask[:, hk // 2:H + hk // 2, wk // 2:W + wk //
                               2].flatten(1)
            mask_inf = paddle.full([H * W, H * W], '-inf', dtype='float32')
            mask = paddle.where(mask_paddle < 1, mask_paddle, mask_inf)
            self.mask = mas

为了方便展示，这里以H=4，W=4， hk=3，wk=5 举例。H 和 W 表示特征图的高和宽，hk 和 wk 表示窗口的高和宽。

mask = paddle.ones([H * W, H + hk - 1, W + wk - 1], dtype='float32')
for h in range(0, H):
    for w in range(0, W):
        mask[h * W + w, h:h + hk, w:w + wk] = 0.

可视化结果如下：

    mask_inf = paddle.full([H * W, H * W], '-inf', dtype='float32')
    mask = paddle.where(mask_paddle < 1, mask_paddle, mask_inf)

然后将上图中的 1 全部替换为负无穷大，可视化结果如下：

-inf 表示当前位置不需要计算 attention 值

下面就以 H=4，W=4， hk=3，wk=5 举例，来可视化解释每个特征点是如何根据上图来计算 attention 值的

上图解释了特征图上的每个点需要与哪些特征点计算 attention 值（红色边框区域）。以第一个特征点举例，当前特征点只需要与第 1、2、3、5、6、7 个特征点计算 attention 值，对应到 mask 图片上的第一行。

3.2.2.2. 计算 q、k、v

def forward(self, x):
    if self.HW is not None:
        N = self.N
        C = self.C
    else:
        _, N, C = x.shape
    qkv = self.qkv(x).reshape((0, N, 3, self.num_heads, C //
                               self.num_heads)).transpose((2, 0, 3, 1, 4))
    q, k, v = qkv[0] * self.scale, qkv[1], qkv[2]

    attn = (q.matmul(k.transpose((0, 1, 3, 2))))
    if self.mixer == 'Local':
        mask_array = np.array(self.mask)
        attn += self.mask
    attn = nn.functional.softmax(attn, axis=-1)
    attn = self.attn_drop(attn)

    x = (attn.matmul(v)).transpose((0, 2, 1, 3)).reshape((0, N, C))
    x = self.proj(x)
    x = self.proj_drop(x)
    return x

其中，qkv = self.qkv(x).reshape((0, N, 3, self.num_heads, C //self.num_heads)).transpose((2, 0, 3, 1, 4))用于计算 q、k、v。w_q、w_k、w_v 的 shape 为(8, 24)，输出的 qkv 的 shape 为(2, 200, 24)，reshape 之后的 shape 为(2, 200, 3, 2, 4)，transpose 之后的 shape 为(3, 2, 2, 200, 4)。

q, k, v = qkv[0] * self.scale, qkv[1], qkv[2]，q、k、v 分别的 shape 为 (2, 2, 200 ,4)。第一个 2 表示 batchsize，第二个 2 表示多头注意力的头的数量。

3.2.2.3. 计算 attention 值

attention 的计算公式如下：

$\operatorname{Attention}(\mathrm{Q}, \mathrm{K}, \mathrm{V})=\operatorname{SoftMax}\left(\frac{\mathrm{QK}^{\mathrm{T}}}{\sqrt{\mathrm{d}}}\right) \mathrm{V}$

attn = (q.matmul(k.transpose((0, 1, 3, 2))))	# (2, 2, 200, 200)
if self.mixer == 'Local':		
    attn += self.mask							# 将计算出的attn加上定义好的mask，mask的shape为(1, 1, 200, 200)
attn = nn.functional.softmax(attn, axis=-1)		# softmax层，-inf经过softmax之后就变成0了(2, 2, 200, 200)
attn = self.attn_drop(attn)						# dropout层，这里的比例为0，即不进行dropout(2, 2, 200, 200)

x = (attn.matmul(v)).transpose((0, 2, 1, 3)).reshape((0, N, C))	# 最终的attention值(2, 200, 8)

3.2.3. LocalMixing + 恒等映射

x = x + self.drop_path(self.mixer(self.norm1(x)))

3.2.4. MixingBlock 中的剩余部分

x = x + self.drop_path(self.mlp(self.norm2(x)))	# (2, 200, 8)

其中，mlp 网络包含了如下内容

def forward(self, x):
    x = self.fc1(x)		# 第一个全连接层
    x = self.act(x)		# 激活层
    x = self.drop(x)	# dropout层
    x = self.fc2(x)		# 第二个全连接层
    x = self.drop(x)	# dropout层
    return x

3.3. 第一个 PatchMerging

if self.patch_merging is not None:
    x = self.sub_sample1(
        x.transpose([0, 2, 1]).reshape(
            [0, self.embed_dim[0], self.HW[0], self.HW[1]]))

输入：(2, 8, 8, 25)

处理：nn.Conv2D、nn.LayerNorm

输出：(2, 100, 8)

3.4. 第二个 MixingBlock

代码同第一个 MixingBlock 完全一样，这里就不再重复了。

3.5. 第二个 PatchMerging

输入：(2, 8, 4, 25)

处理：nn.Conv2D、nn.LayerNorm

输出：(2, 50, 8)

3.6. 第三个 MixingBlock

def forward(self, x):
    if self.HW is not None:
        N = self.N
        C = self.C
    else:
        _, N, C = x.shape
    qkv = self.qkv(x).reshape((0, N, 3, self.num_heads, C //
                               self.num_heads)).transpose((2, 0, 3, 1, 4))
    q, k, v = qkv[0] * self.scale, qkv[1], qkv[2]

    attn = (q.matmul(k.transpose((0, 1, 3, 2))))
    # 对于LocalMixing才需要mask
    if self.mixer == 'Local':
        attn += self.mask
    attn = nn.functional.softmax(attn, axis=-1)
    attn = self.attn_drop(attn)

    x = (attn.matmul(v)).transpose((0, 2, 1, 3)).reshape((0, N, C))
    x = self.proj(x)
    x = self.proj_drop(x)
    return x

当前 Mixing Block 使用的是Global Mixing，对于Global Mixing 来说，每个点需要同所有的点计算 attention 值，就没有 mask 了。其他部分的代码都是一样的，这里就不再重复了。

def forward(self, x):
    x = self.forward_features(x)  # (2,3,32,100)
    if self.use_lenhead:
        len_x = self.len_conv(x.mean(1))
        len_x = self.dropout_len(self.hardswish_len(len_x))
    if self.last_stage:
        if self.patch_merging is not None:
            h = self.HW[0] // 4
        else:
            h = self.HW[0]
        x = self.avg_pool(
            x.transpose([0, 2, 1]).reshape(
                [0, self.embed_dim[2], h, self.HW[1]]))
        x = self.last_conv(x)
        x = self.hardswish(x)
        x = self.dropout(x)
    if self.use_lenhead:
        return x, len_x
    return x    def forward(self, x):
    x = self.forward_features(x)  # (2,3,32,100)
    if self.use_lenhead:
        len_x = self.len_conv(x.mean(1))
        len_x = self.dropout_len(self.hardswish_len(len_x))
    if self.last_stage:
        if self.patch_merging is not None:
            h = self.HW[0] // 4
        else:
            h = self.HW[0]
        x = self.avg_pool(
            x.transpose([0, 2, 1]).reshape(
                [0, self.embed_dim[2], h, self.HW[1]]))
        x = self.last_conv(x)
        x = self.hardswish(x)
        x = self.dropout(x)
    if self.use_lenhead:
        return x, len_x
    return x

3.7. PatchCombing

def forward(self, x):	# (2,3,32,100)
    # (2, 50, 8)
    x = self.forward_features(x)  
    if self.use_lenhead:
        len_x = self.len_conv(x.mean(1))
        len_x = self.dropout_len(self.hardswish_len(len_x))
    # 从这里开始
    if self.last_stage:
        if self.patch_merging is not None:
            h = self.HW[0] // 4
        else:
            h = self.HW[0]
        # 先有一个池化层(2, 8, 1, 25)
        x = self.avg_pool(
            x.transpose([0, 2, 1]).reshape(
                [0, self.embed_dim[2], h, self.HW[1]]))
        # 然后有一个卷积层(2, 192, 1, 25)
        x = self.last_conv(x)
        # 然后经过Hardswish激活函数(2, 192, 1, 25)
        x = self.hardswish(x)
        # 最后是dropout层(2, 192, 1, 25)
        x = self.dropout(x)
    if self.use_lenhead:
        return x, len_x
    return x

4. Neck

输入：(2, 192, 1, 25)

处理：squeeze、transpose

输出：(2, 25, 37)

def forward(self, x): # (2, 192, 1, 25)
    if self.encoder_type != 'svtr':
        # 将特征图转化为序列图
        x = self.encoder_reshape(x)
        if not self.only_reshape:
            x = self.encoder(x)
        return x
    else:
        x = self.encoder(x)
        x = self.encoder_reshape(x)
        return x

其中，x = self.encoder_reshape(x)的核心代码如下：

def forward(self, x): # (2, 192, 1, 25)
    B, C, H, W = x.shape
    assert H == 1
    # (2, 192, 25)
    x = x.squeeze(axis=2)
    # (2, 25, 192) 表示的意思就是有25个序列，每个序列的维度是192
    x = x.transpose([0, 2, 1])  # (NTC)(batch, width, channels)
    return x

5. Head

输入：(2, 25, 192)

处理：nn.Linear

输出：(2, 25, 37)

def forward(self, x, targets=None): # (2, 25, 192)
    if self.mid_channels is None:
        # 经过一个全连接层(2, 25, 37)
        predicts = self.fc(x)
    else:
        x = self.fc1(x)
        predicts = self.fc2(x)

    if self.return_feats:
        result = (x, predicts)
    else:
        result = predicts
    if not self.training:
        predicts = F.softmax(predicts, axis=2)
        result = predicts

    return result