论文地址:https://arxiv.org/pdf/2205.00159.pdf
模型代码:https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.7/ppocr/modeling/backbones/rec_svtrnet.py
1. 代码使用说明
1.1. 准备数据集
从https://rrc.cvc.uab.es/?ch=4&com=tasks下载数据集(用于调试代码)
数据的样式如下:
将下载好的数据放到相应目录。由于是文字识别任务,因此只需要 gt.txt 文件,不需要 coords.txt 文件,然后将 gt.txt 文件重命名。我这里的 train_list.txt 就是训练集对应的 gt.txt 文件,val_list.txt 就是验证集对应的 gt.txt 文件。
1.2. 配置代码
在拉取 paddleocr 项目代码后,打开configs/rec/rec_svtrnet.yml
文件,修改对应参数,有注释的地方都需要根据实际情况来调整。
Global:
use_gpu: False # 是否使用gpu
epoch_num: 20
log_smooth_window: 20
print_batch_step: 10
save_model_dir: ./output/rec/svtr/
save_epoch_step: 1
# evaluation is run every 2000 iterations after the 0th iteration
eval_batch_step: [10, 2000] # [0, 2000]
cal_metric_during_train: True
pretrained_model:
checkpoints:
save_inference_dir:
use_visualdl: False
infer_img: doc/imgs_words_en/word_10.png
# for data or label process
character_dict_path: ../ppocr/utils/ic15_dict.txt # 字符字典文件
character_type: en
max_text_length: 25
infer_mode: False
use_space_char: False
save_res_path: ./output/rec/predicts_svtr_tiny.txt
Optimizer:
name: AdamW
beta1: 0.9
beta2: 0.99
epsilon: 8.e-8
weight_decay: 0.05
no_weight_decay_name: norm pos_embed
one_dim_param_no_weight_decay: true
lr:
name: Cosine
learning_rate: 0.0005
warmup_epoch: 2
Architecture:
model_type: rec
algorithm: SVTR
Transform:
name: STN_ON
tps_inputsize: [32, 64] # [32, 64]
tps_outputsize: [32, 100] # [32, 100]
num_control_points: 20
tps_margins: [0.05,0.05]
stn_activation: none
Backbone:
name: SVTRNet
img_size: [32, 100] # [32, 100] 输入图片的大小
out_char_num: 25 # 输出字符的个数
out_channels: 192 # 输出维度
patch_merging: 'Conv'
embed_dim: [8, 8, 8] # [64, 128, 256]每个block的特征维度
depth: [3, 3, 3] # [3, 6, 3]每个block的深度
num_heads: [2, 2, 2] # [2, 4, 8]每个block有几头注意力
mixer: ['Local','Local','Local','Local','Local','Local','Global','Global','Global'] # ['Local','Local','Local','Local','Local','Local','Global','Global','Global','Global','Global','Global'] 每个block的种类,对应depth参数
local_mixer: [[7, 11], [7, 11], [7, 11]] # Local Mixing的窗口大小
last_stage: True
prenorm: false
Neck:
name: SequenceEncoder
encoder_type: reshape
Head:
name: CTCHead
Loss:
name: CTCLoss
PostProcess:
name: CTCLabelDecode
Metric:
name: RecMetric
main_indicator: acc
Train:
dataset:
name: SimpleDataSet
data_dir: ../train_data/train # 训练数据的路径
label_file_list:
- ../train_data/train_list.txt # 训练数据的标签文件
transforms:
- DecodeImage: # load image
img_mode: BGR
channel_first: False
- CTCLabelEncode: # Class handling label
- SVTRRecResizeImg:
image_shape: [3, 64, 256]
padding: False
- KeepKeys:
keep_keys: ['image', 'label', 'length'] # dataloader will return list in this order
loader:
shuffle: True
batch_size_per_card: 2 # 512 调整batch_size大小
drop_last: True
num_workers: 0 # 4
Eval:
dataset:
name: SimpleDataSet
data_dir: ../train_data/val # 验证数据的路径
label_file_list:
- ../train_data/val_list.txt # 验证数据的标签文件
transforms:
- DecodeImage: # load image
img_mode: BGR
channel_first: False
- CTCLabelEncode: # Class handling label
- SVTRRecResizeImg:
image_shape: [3, 64, 256]
padding: False
- KeepKeys:
keep_keys: ['image', 'label', 'length'] # dataloader will return list in this order
loader:
shuffle: False
drop_last: False
batch_size_per_card: 256 # 调整batch_size大小
num_workers: 0
还要对ppocr/data/simple_dataset.py
进行修改,如下所示:
class SimpleDataSet(Dataset):
def __init__(self, config, mode, logger, seed=None):
super(SimpleDataSet, self).__init__()
self.logger = logger
self.mode = mode.lower()
global_config = config['Global']
dataset_config = config[mode]['dataset']
loader_config = config[mode]['loader']
# 需要修改为自己的分隔符,即train_list.txt和即val_list.txt文件中每行的分隔符
self.delimiter = dataset_config.get('delimiter', ',')
label_file_list = dataset_config.pop('label_file_list')
data_source_num = len(label_file_list)
ratio_list = dataset_config.get("ratio_list", 1.0)
if isinstance(ratio_list, (float, int)):
ratio_list = [float(ratio_list)] * int(data_source_num)
......
1.3. 运行代码
打开tools/train.py
,配置运行参数-c ../configs/rec/rec_svtrnet.yml
,运行代码。
2. 整体流程图
3. Backbone
3.1. PatchEmbedding
Patch Embeeding操作用于将输入图像分成N个patch,在实际代码中,还是执行的是卷积操作。这里的 p 为 4(32/8),
输入:经过预处理之后的文字图片,在我的代码中,输入数据的维度是(2,3,32,100)。其中,2 表示 batch_size;3 表示图片 channel;32 表示图片的高;100 表示图片的宽。
处理:nn.Conv2D、nn.BatchNorm2D、nn.GELU
输出:(2, 200, 8)
核心代码:x = self.proj(x).flatten(2).transpose((0, 2, 1))
其中,self.proj(x)的核心代码如下:
# 两层
def forward(self, inputs): # (2,3,32,100)
out = self.conv(inputs) # (2,4,16,50)
out = self.norm(out) # (2,4,16,50)
out = self.act(out) # (2,4,16,50)
return out
def forward(self, inputs): # (2,4,16,50)
out = self.conv(inputs) # (2,8,8,25)
out = self.norm(out) # (2,8,8,25)
out = self.act(out) # (2,8,8,25)
return out
flatten 之后的 shape 为(2, 8, 200),transpose 之后的 shape 为(2, 200, 8)。200 表示有 200 个 patch,8 表示每个 patch 的维度为 8。
x = x + self.pos_embed
。其中,self.pos_embed 表示位置编码,是可学习的参数, shape 为(1, 200, 8)。
3.2. 第一个 MixingBlock
3.2.1. MixingBlock 中的第一个 LN 层
x = x + self.drop_path(self.mixer(self.norm1(x)))
3.2.2. LocalMixing(核心)
就是在各自的 window(7, 11) 中计算 attention 值,核心代码如下:
def forward(self, x):
if self.HW is not None:
N = self.N
C = self.C
else:
_, N, C = x.shape
qkv = self.qkv(x).reshape((0, N, 3, self.num_heads, C //
self.num_heads)).transpose((2, 0, 3, 1, 4))
q, k, v = qkv[0] * self.scale, qkv[1], qkv[2]
attn = (q.matmul(k.transpose((0, 1, 3, 2))))
# 对于LocalMixing才需要mask
if self.mixer == 'Local':
attn += self.mask
attn = nn.functional.softmax(attn, axis=-1)
attn = self.attn_drop(attn)
x = (attn.matmul(v)).transpose((0, 2, 1, 3)).reshape((0, N, C))
x = self.proj(x)
x = self.proj_drop(x)
return x
3.2.2.1. attention mask
即上面代码中的self.mask
在局部融合中,当前位置只需要与以当前位置为中心点,同 window_size(7, 11) 内的其他位置计算 attention 值。因此需要标记出哪些位置需要计算 attention 值,哪些位置不需要计算 attention 值,attention mask 就是用来干这事的。核心代码如下:
class Attention(nn.Layer):
def __init__(self,
dim,
num_heads=8,
mixer='Global',
HW=None,
local_k=[7, 11],
qkv_bias=False,
qk_scale=None,
attn_drop=0.,
proj_drop=0.):
super().__init__()
self.num_heads = num_heads
head_dim = dim // num_heads
self.scale = qk_scale or head_dim**-0.5
self.qkv = nn.Linear(dim, dim * 3, bias_attr=qkv_bias)
self.attn_drop = nn.Dropout(attn_drop)
self.proj = nn.Linear(dim, dim)
self.proj_drop = nn.Dropout(proj_drop)
self.HW = HW
if HW is not None:
H = HW[0]
W = HW[1]
self.N = H * W
self.C = dim
if mixer == 'Local' and HW is not None:
hk = local_k[0]
wk = local_k[1]
mask = paddle.ones([H * W, H + hk - 1, W + wk - 1], dtype='float32')
for h in range(0, H):
for w in range(0, W):
mask[h * W + w, h:h + hk, w:w + wk] = 0.
mask_paddle = mask[:, hk // 2:H + hk // 2, wk // 2:W + wk //
2].flatten(1)
mask_inf = paddle.full([H * W, H * W], '-inf', dtype='float32')
mask = paddle.where(mask_paddle < 1, mask_paddle, mask_inf)
self.mask = mas
为了方便展示,这里以H=4,W=4, hk=3,wk=5 举例。H 和 W 表示特征图的高和宽,hk 和 wk 表示窗口的高和宽。
mask = paddle.ones([H * W, H + hk - 1, W + wk - 1], dtype='float32')
for h in range(0, H):
for w in range(0, W):
mask[h * W + w, h:h + hk, w:w + wk] = 0.
可视化结果如下:
mask_inf = paddle.full([H * W, H * W], '-inf', dtype='float32')
mask = paddle.where(mask_paddle < 1, mask_paddle, mask_inf)
然后将上图中的 1 全部替换为负无穷大,可视化结果如下:
-inf 表示当前位置不需要计算 attention 值
下面就以 H=4,W=4, hk=3,wk=5 举例,来可视化解释每个特征点是如何根据上图来计算 attention 值的
上图解释了特征图上的每个点需要与哪些特征点计算 attention 值(红色边框区域)。以第一个特征点举例,当前特征点只需要与第 1、2、3、5、6、7 个特征点计算 attention 值,对应到 mask 图片上的第一行。
3.2.2.2. 计算 q、k、v
def forward(self, x):
if self.HW is not None:
N = self.N
C = self.C
else:
_, N, C = x.shape
qkv = self.qkv(x).reshape((0, N, 3, self.num_heads, C //
self.num_heads)).transpose((2, 0, 3, 1, 4))
q, k, v = qkv[0] * self.scale, qkv[1], qkv[2]
attn = (q.matmul(k.transpose((0, 1, 3, 2))))
if self.mixer == 'Local':
mask_array = np.array(self.mask)
attn += self.mask
attn = nn.functional.softmax(attn, axis=-1)
attn = self.attn_drop(attn)
x = (attn.matmul(v)).transpose((0, 2, 1, 3)).reshape((0, N, C))
x = self.proj(x)
x = self.proj_drop(x)
return x
其中,qkv = self.qkv(x).reshape((0, N, 3, self.num_heads, C //self.num_heads)).transpose((2, 0, 3, 1, 4))
用于计算 q、k、v。w_q、w_k、w_v 的 shape 为(8, 24),输出的 qkv 的 shape 为(2, 200, 24),reshape 之后的 shape 为(2, 200, 3, 2, 4),transpose 之后的 shape 为(3, 2, 2, 200, 4)。
q, k, v = qkv[0] * self.scale, qkv[1], qkv[2]
,q、k、v 分别的 shape 为 (2, 2, 200 ,4)。第一个 2 表示 batchsize,第二个 2 表示多头注意力的头的数量。
3.2.2.3. 计算 attention 值
attention 的计算公式如下:
attn = (q.matmul(k.transpose((0, 1, 3, 2)))) # (2, 2, 200, 200)
if self.mixer == 'Local':
attn += self.mask # 将计算出的attn加上定义好的mask,mask的shape为(1, 1, 200, 200)
attn = nn.functional.softmax(attn, axis=-1) # softmax层,-inf经过softmax之后就变成0了(2, 2, 200, 200)
attn = self.attn_drop(attn) # dropout层,这里的比例为0,即不进行dropout(2, 2, 200, 200)
x = (attn.matmul(v)).transpose((0, 2, 1, 3)).reshape((0, N, C)) # 最终的attention值(2, 200, 8)
3.2.3. LocalMixing + 恒等映射
x = x + self.drop_path(self.mixer(self.norm1(x)))
3.2.4. MixingBlock 中的剩余部分
x = x + self.drop_path(self.mlp(self.norm2(x))) # (2, 200, 8)
其中,mlp 网络包含了如下内容
def forward(self, x):
x = self.fc1(x) # 第一个全连接层
x = self.act(x) # 激活层
x = self.drop(x) # dropout层
x = self.fc2(x) # 第二个全连接层
x = self.drop(x) # dropout层
return x
3.3. 第一个 PatchMerging
if self.patch_merging is not None:
x = self.sub_sample1(
x.transpose([0, 2, 1]).reshape(
[0, self.embed_dim[0], self.HW[0], self.HW[1]]))
输入:(2, 8, 8, 25)
处理:nn.Conv2D、nn.LayerNorm
输出:(2, 100, 8)
3.4. 第二个 MixingBlock
代码同第一个 MixingBlock 完全一样,这里就不再重复了。
3.5. 第二个 PatchMerging
输入:(2, 8, 4, 25)
处理:nn.Conv2D、nn.LayerNorm
输出:(2, 50, 8)
3.6. 第三个 MixingBlock
def forward(self, x):
if self.HW is not None:
N = self.N
C = self.C
else:
_, N, C = x.shape
qkv = self.qkv(x).reshape((0, N, 3, self.num_heads, C //
self.num_heads)).transpose((2, 0, 3, 1, 4))
q, k, v = qkv[0] * self.scale, qkv[1], qkv[2]
attn = (q.matmul(k.transpose((0, 1, 3, 2))))
# 对于LocalMixing才需要mask
if self.mixer == 'Local':
attn += self.mask
attn = nn.functional.softmax(attn, axis=-1)
attn = self.attn_drop(attn)
x = (attn.matmul(v)).transpose((0, 2, 1, 3)).reshape((0, N, C))
x = self.proj(x)
x = self.proj_drop(x)
return x
当前 Mixing Block 使用的是Global Mixing,对于Global Mixing 来说,每个点需要同所有的点计算 attention 值,就没有 mask 了。其他部分的代码都是一样的,这里就不再重复了。
def forward(self, x):
x = self.forward_features(x) # (2,3,32,100)
if self.use_lenhead:
len_x = self.len_conv(x.mean(1))
len_x = self.dropout_len(self.hardswish_len(len_x))
if self.last_stage:
if self.patch_merging is not None:
h = self.HW[0] // 4
else:
h = self.HW[0]
x = self.avg_pool(
x.transpose([0, 2, 1]).reshape(
[0, self.embed_dim[2], h, self.HW[1]]))
x = self.last_conv(x)
x = self.hardswish(x)
x = self.dropout(x)
if self.use_lenhead:
return x, len_x
return x def forward(self, x):
x = self.forward_features(x) # (2,3,32,100)
if self.use_lenhead:
len_x = self.len_conv(x.mean(1))
len_x = self.dropout_len(self.hardswish_len(len_x))
if self.last_stage:
if self.patch_merging is not None:
h = self.HW[0] // 4
else:
h = self.HW[0]
x = self.avg_pool(
x.transpose([0, 2, 1]).reshape(
[0, self.embed_dim[2], h, self.HW[1]]))
x = self.last_conv(x)
x = self.hardswish(x)
x = self.dropout(x)
if self.use_lenhead:
return x, len_x
return x
3.7. PatchCombing
def forward(self, x): # (2,3,32,100)
# (2, 50, 8)
x = self.forward_features(x)
if self.use_lenhead:
len_x = self.len_conv(x.mean(1))
len_x = self.dropout_len(self.hardswish_len(len_x))
# 从这里开始
if self.last_stage:
if self.patch_merging is not None:
h = self.HW[0] // 4
else:
h = self.HW[0]
# 先有一个池化层(2, 8, 1, 25)
x = self.avg_pool(
x.transpose([0, 2, 1]).reshape(
[0, self.embed_dim[2], h, self.HW[1]]))
# 然后有一个卷积层(2, 192, 1, 25)
x = self.last_conv(x)
# 然后经过Hardswish激活函数(2, 192, 1, 25)
x = self.hardswish(x)
# 最后是dropout层(2, 192, 1, 25)
x = self.dropout(x)
if self.use_lenhead:
return x, len_x
return x
4. Neck
输入:(2, 192, 1, 25)
处理:squeeze、transpose
输出:(2, 25, 37)
def forward(self, x): # (2, 192, 1, 25)
if self.encoder_type != 'svtr':
# 将特征图转化为序列图
x = self.encoder_reshape(x)
if not self.only_reshape:
x = self.encoder(x)
return x
else:
x = self.encoder(x)
x = self.encoder_reshape(x)
return x
其中,x = self.encoder_reshape(x)的核心代码如下:
def forward(self, x): # (2, 192, 1, 25)
B, C, H, W = x.shape
assert H == 1
# (2, 192, 25)
x = x.squeeze(axis=2)
# (2, 25, 192) 表示的意思就是有25个序列,每个序列的维度是192
x = x.transpose([0, 2, 1]) # (NTC)(batch, width, channels)
return x
5. Head
输入:(2, 25, 192)
处理:nn.Linear
输出:(2, 25, 37)
def forward(self, x, targets=None): # (2, 25, 192)
if self.mid_channels is None:
# 经过一个全连接层(2, 25, 37)
predicts = self.fc(x)
else:
x = self.fc1(x)
predicts = self.fc2(x)
if self.return_feats:
result = (x, predicts)
else:
result = predicts
if not self.training:
predicts = F.softmax(predicts, axis=2)
result = predicts
return result
需要说明的是,对于输出的维度(2, 25, 37),25 表示输出的字符的个数,即最多只能输出 25 个字符;37 表示总共有可能的字符的个数,这个数字是配置文件character_dict_path: ../ppocr/utils/ic15_dict.txt 中字符数量再加上一个空格符。
关于字符个数的限制,如果一张图片太长,字符个数太多,有可能就会出现漏识别的情况,如下图所示:
由于每个位置只能预测一个字符,因此上面用颜色标注的位置就可能出现漏识别的情况。
本文仅为个人学习记录,如有错误,还请各位大佬不吝赐教