引言
OCR中一般有两种思路,一是文本检测和识别分开训练,分别是两个模型;一种是用一个统一的模型。前者可能更加准确,可以单独优化;后者速度会更快 这次认真阅读了ABCNetv2的论文和对应源码,在此整理 (PDF | Code ) 吐槽一句:源码由于封装在AdelaiDet工具箱中,十分不易阅读-_-!但是作者能开源这个工作已经很棒了,还要啥自行车。 考虑到ABCNet环境难以搭建,整理了一个Docker镜像,环境搭建有问题的小伙伴,可以尝试一下,地址:docker abcnetv2
ABCNetv2整体结构
具体结构分析与对应
整个结构可以分为6部分:Backbone部分、BiFPN部分、CoordConv部分、BezierAlign对齐部分 、CRNN部分 和基于Attention解码 部分 下面结合对应源码来做一一说明(以训练ReCTS数据集为例 )
Backbone部分
从configs/BAText/ReCTS/v2_chn_attn_R_50.yaml
中可以看到用的Backbone是build_fcos_resnet_bifpn_backbone
,主要结构采用的是ResNet50 MODEL :
WEIGHTS : "https://dl.fbaipublicfiles.com/detectron/ImageNetPretrained/MSRA/R-50.pkl"
BACKBONE :
NAME : "build_fcos_resnet_bifpn_backbone"
BiFPN :
IN_FEATURES : [ "res2" , "res3" , "res4" , "res5" ]
OUT_CHANNELS : 256
NUM_REPEATS : 2
NORM : "SyncBN"
RESNETS :
DEPTH : 50
BATEXT :
RECOGNIZER : "attn"
USE_COORDCONV : True
USE_AET : True
VOC_SIZE : 5462
CUSTOM_DICT : "chn_cls_list"
其中FCOS 也是工具箱中集成的ICCV2019的工作,正如其题目描述的Fully Convolutional One-Stage Object Detection 这样,这篇工作主要提出了全卷积一阶段目标检测的框架。 build_fcos_resnet_bifpn_backbone
函数主要位于: adet/modeling/backbone/bifpn.py
下,主要代码如下(关键位置已经给出注释):@BACKBONE_REGISTRY. register ( )
def build_fcos_resnet_bifpn_backbone ( cfg, input_shape: ShapeSpec) :
"""
Args:
cfg: a detectron2 CfgNode
Returns:
backbone (Backbone): backbone module, must be a subclass of :class:`Backbone`.
"""
if cfg. MODEL. MOBILENET:
bottom_up = build_mnv2_backbone( cfg, input_shape)
else :
bottom_up = build_resnet_backbone( cfg, input_shape)
in_features = cfg. MODEL. BiFPN. IN_FEATURES
out_channels = cfg. MODEL. BiFPN. OUT_CHANNELS
num_repeats = cfg. MODEL. BiFPN. NUM_REPEATS
top_levels = cfg. MODEL. FCOS. TOP_LEVELS
backbone = BiFPN(
bottom_up= bottom_up,
in_features= in_features,
out_channels= out_channels,
num_top_levels= top_levels,
num_repeats= num_repeats,
norm= cfg. MODEL. BiFPN. NORM
)
return backbone
BiFPN部分
该部分主要是源自于FCOS网络,采用级联的两个FPN结构,提取多尺度特征。从ABCNetV2的结构图中(红色框线部分 )可以看到BiFPN 该部分主要代码位于adet/modeling/backbone/bifpn.py#L280
处
class BiFPN ( Backbone) :
"""
This module implements Feature Pyramid Network.
It creates pyramid features built on top of some input feature maps.
"""
def __init__ (
self, bottom_up, in_features, out_channels, num_top_levels, num_repeats, norm= ""
) :
super ( BiFPN, self) . __init__( )
assert isinstance ( bottom_up, Backbone)
self. bottom_up = BackboneWithTopLevels(
bottom_up, out_channels,
num_top_levels, norm
)
self. repeated_bifpn = nn. ModuleList( )
for i in range ( num_repeats) :
if i == 0 :
in_channels_list = [
bottom_up_output_shapes[ name] . channels for name in in_features
]
else :
in_channels_list = [
self. _out_feature_channels[ name] for name in self. _out_features
]
self. repeated_bifpn. append( SingleBiFPN(
in_channels_list, out_channels, norm
) )
def forward ( self, x) :
bottom_up_features = self. bottom_up( x)
feats = [ bottom_up_features[ f] for f in self. in_features]
for bifpn in self. repeated_bifpn:
feats = bifpn( feats)
return dict ( zip ( self. _out_features, feats) )
CoordConv部分
该部分主要参考论文:An intriguing failing of convolutional neural networks and the CoordConv solution ,主要结构如下图: 对应源码部分主要位于adet/modeling/roi_heads/text_head.py#L73
处,主要实现的类为MaskHead
class MaskHead ( nn. Module) :
def __init__ ( self, cfg) :
super ( MaskHead, self) . __init__( )
conv_dim = cfg. MODEL. BATEXT. CONV_DIM
conv_block = conv_with_kaiming_uniform(
norm= "BN" , activation= True )
convs = [ ]
convs. append( conv_block( 258 , conv_dim, 3 , 1 ) )
for i in range ( 3 ) :
convs. append( conv_block(
conv_dim, conv_dim, 3 , 1 ) )
self. mask_convs = nn. Sequential( * convs)
def forward ( self, features) :
x_range = torch. linspace( - 1 , 1 , features. shape[ - 1 ] , device= features. device)
y_range = torch. linspace( - 1 , 1 , features. shape[ - 2 ] , device= features. device)
y, x = torch. meshgrid( y_range, x_range)
y = y. expand( [ features. shape[ 0 ] , 1 , - 1 , - 1 ] )
x = x. expand( [ features. shape[ 0 ] , 1 , - 1 , - 1 ] )
coord_feat = torch. cat( [ x, y] , 1 )
ins_features = torch. cat( [ features, coord_feat] , dim= 1 )
mask_features = self. mask_convs( ins_features)
return mask_features
该篇论文中尝试解决的问题,大家对此褒贬不一。不过ABCNetV2采用这个,也给出了相关的实验结果,就暂时给予认可。
BezizerAlign部分
该部分是在Bezizer控制点基础上,对RoI区域进行pool,对8个控制点之间的点是通过双线性插值方式得到的。通过论文中各种方式的Align对比结果,可以清晰看出怎么做的,如下图所示: 该部分源码主要是通过C++实现,PyTorch调用编译后的C++程序嵌入到现有程序中,主要位于adet/layers/csrc/BezierAlign/BezierAlign_cpu.cpp#L215
处。(主要位置添加了中文注释)
int roi_bin_grid_h = ( sampling_ratio > 0 )
? sampling_ratio
: ceil ( roi_height / pooled_height) ;
int roi_bin_grid_w =
( sampling_ratio > 0 ) ? sampling_ratio : ceil ( roi_width / pooled_width) ;
const T count = std:: max ( roi_bin_grid_h * roi_bin_grid_w, 1 ) ;
std:: vector< PreCalc< T>> pre_calc (
roi_bin_grid_h * roi_bin_grid_w * pooled_width * pooled_height) ;
pre_calc_for_bilinear_interpolate (
height,
width,
pooled_height,
pooled_width,
roi_bin_grid_h,
roi_bin_grid_w,
p0_x, p0_y, p1_x, p1_y,
p2_x, p2_y, p3_x, p3_y,
p4_x, p4_y, p5_x, p5_y,
p6_x, p6_y, p7_x, p7_y,
bin_size_h,
bin_size_w,
roi_bin_grid_h,
roi_bin_grid_w,
pre_calc) ;
for ( int c = 0 ; c < channels; c++ ) {
int index_n_c = index_n + c * pooled_width * pooled_height;
const T* offset_input =
input + ( roi_batch_ind * channels + c) * height * width;
int pre_calc_index = 0 ;
for ( int ph = 0 ; ph < pooled_height; ph++ ) {
for ( int pw = 0 ; pw < pooled_width; pw++ ) {
int index = index_n_c + ph * pooled_width + pw;
T output_val = 0. ;
for ( int iy = 0 ; iy < roi_bin_grid_h; iy++ ) {
for ( int ix = 0 ; ix < roi_bin_grid_w; ix++ ) {
PreCalc< T> pc = pre_calc[ pre_calc_index] ;
output_val += pc. w1 * offset_input[ pc. pos1] +
pc. w2 * offset_input[ pc. pos2] +
pc. w3 * offset_input[ pc. pos3] + pc. w4 * offset_input[ pc. pos4] ;
pre_calc_index += 1 ;
}
}
output_val /= count;
output[ index] = output_val;
}
}
}
}
}
CRNN部分
该部分论文中构建了一个只有几层的小网络,具体组成可以由以下图中看出,主要由6个卷积层+一个BLSTM+基于Attention的解码器组成。 该部分代码主要位于adet/modeling/roi_heads/attn_predictor.py
中,可以参考以下CRNN的源码。不过从源码可以看到,网络结构与论文中提到的用了6个卷积层不同的是,源码中只用了2个卷积层。不知这是不是作者的笔误,还是另有深意,暂时不得而知。
class CRNN ( nn. Module) :
def __init__ ( self, cfg, in_channels) :
super ( CRNN, self) . __init__( )
conv_func = conv_with_kaiming_uniform( norm= "GN" , activation= True )
convs = [ ]
for i in range ( 2 ) :
convs. append( conv_func( in_channels, in_channels, 3 , stride= ( 2 , 1 ) ) )
self. convs = nn. Sequential( * convs)
self. rnn = BidirectionalLSTM( in_channels, in_channels, in_channels)
def forward ( self, x) :
x = self. convs( x)
x = x. mean( dim= 2 )
x = x. permute( 2 , 0 , 1 )
x = self. rnn( x)
return x
基于Attention的解码器部分
该论文的v1版本采用的是基于CTC loss的解码器,v2版本采用的是基于Attention的解码器。作者在文中论述到他发现基于attention的方法可以产生更好的结果。 不过,以我目前的经验来看,两种解码器方法各有千秋。工业界用的最多的还是基于CTC loss的解码器 该部分代码主要位于adet/modeling/roi_heads/attn_predictor.py#L49
处的Attention
的类,主要代码如下:
class Attention ( nn. Module) :
def __init__ ( self, cfg, in_channels) :
super ( Attention, self) . __init__( )
self. hidden_size = in_channels
self. output_size = cfg. MODEL. BATEXT. VOC_SIZE + 1
self. dropout_p = 0.1
self. max_len = cfg. MODEL. BATEXT. NUM_CHARS
self. embedding = nn. Embedding( self. output_size, self. hidden_size)
self. attn_combine = nn. Linear( self. hidden_size * 2 , self. hidden_size)
self. dropout = nn. Dropout( self. dropout_p)
self. gru = nn. GRU( self. hidden_size, self. hidden_size)
self. out = nn. Linear( self. hidden_size, self. output_size)
self. vat = nn. Linear( self. hidden_size, 1 )
def forward ( self, input , hidden, encoder_outputs) :
'''
hidden: 1 x n x self.hidden_size
encoder_outputs: time_step x n x self.hidden_size (T,N,C)
'''
embedded = self. embedding( input )
embedded = self. dropout( embedded)
batch_size = encoder_outputs. shape[ 1 ]
alpha = hidden + encoder_outputs
alpha = alpha. view( - 1 , alpha. shape[ - 1 ] )
attn_weights = self. vat( torch. tanh( alpha) )
attn_weights = attn_weights. view( - 1 , 1 , batch_size) . permute( ( 2 , 1 , 0 ) )
attn_weights = F. softmax( attn_weights, dim= 2 )
attn_applied = torch. matmul( attn_weights,
encoder_outputs. permute( ( 1 , 0 , 2 ) ) )
if embedded. dim( ) == 1 :
embedded = embedded. unsqueeze( 0 )
output = torch. cat( ( embedded, attn_applied. squeeze( 1 ) ) , 1 )
output = self. attn_combine( output) . unsqueeze( 0 )
output = F. relu( output)
output, hidden = self. gru( output, hidden)
output = F. log_softmax( self. out( output[ 0 ] ) , dim= 1 )
return output, hidden, attn_weights
模型量化部分
总结
自己选择这一篇论文阅读,主要是考虑到现阶段OCR任务,大多采用非端到端方式拼合合成,导致整体运行时间较长。而端到端的方法,最吸引人的地方,有两点:一是整体结构简单,容易理解;二是推理速度通常都很快。 经过自己的阅读和运行相应的代码,发现ABCNetv2主要有以下几点
经过拿官方Repo公布的中文预训练模型,测试自己找的几张中文图像,发现文本检测效果还可以,但是识别结果不尽人意,究其原因可能是因为训练数据较少(为了可以更公平的和其他论文比较)。示例结果图如下: 还有一点,对于长文本的文本提取效果较差,这一点从官方仓库的issue #443 中看到的,示例图像如下: 由于该论文源码是基于AdelaiDet工具箱和Detectron2实现,代码显得比较乱,配置可以运行的环境也比较麻烦。其中有一点原因是BezizerAlign部分基于C++实现,需要自己编译,这使得整个项目代码更加不友好。同时仓库中提供的Docker镜像,环境也没有配置好。这么一通操作下来,导致门槛很高。 如果顺利地训练出了比较好的模型,部署也是个问题。涉及到BezizerAlign部分基于C++实现,整个模型转ONNX就没有那么容易了。 后面有空的话,倒是想将ABCNetV2的相关源码抽取出来,全部基于Python和PyTorch实现,这样落地就方便许多了。
相关资料