[论文阅读]TextDragon: An End-to-End Framework for Arbitrary Shaped Text Spotting阅读笔记

最新推荐文章于 2024-01-24 15:23:06 发布

一叶知秋Autumn

最新推荐文章于 2024-01-24 15:23:06 发布

阅读量2.1k

点赞数 1

分类专栏：计算机视觉文章标签：计算机视觉人工智能

本文链接：https://blog.csdn.net/SpicyCoder/article/details/104774201

版权

计算机视觉专栏收录该内容

14 篇文章 3 订阅

订阅专栏

TextDragon: An End-to-End Framework for Arbitrary Shaped Text Spotting阅读笔记

文章被收录于ICCV2019
[论文地址]:http://openaccess.thecvf.com/content_ICCV_2019/html/Feng_TextDragon_An_End-to-End_Framework_for_Arbitrary_Shaped_Text_Spotting_ICCV_2019_paper.html
[代码地址]:暂未找到

摘要

本文提出一种用来制造文本检测与识别关系的可微运算RoISlide，使模型成为端到端模型。本文在两个弯曲文本数据集CTW1500和Total-Text上的表现达到最佳，在常规文本数据集ICDAR2015上达到了具有竞争力的结果。

介绍

目前，文本检测的现有方法大多数是通过两步实现：文本检测与文本识别。这样的方式具有时间成本高和忽略了文本检测与识别之间的联系这两个缺点。

本文提出的TextDragon灵感来源于TextSnake[32]，TextSnake文本检测的方式是使用一系列的局部单元，因此可以实现任意形状的文本检测。但是其在训练过程中需要字符级别的标签，一些数据集并没有提供此类标签，因此可能需要耗费大量人工成本。

本文为了实现任意形状文本的检测，使用了一系列局部四边形来定位复杂的文本。
如图2所示，RoISlide连接了检测与识别模块，用于从特征图中提取特征和纠正任意形状文本区域，从而减少了字符大小与方向的变化。之后，经过校正的文本特征输入到CNN和Connectionist Temporal Classification(CTC)中来生成最终的结果。此外，TextDragon是第一个可训练的端到端的实现任意形状文本检测的模型，且仅仅使用单词级别或行级别的标签就可以完成检测任务。

三大贡献：
(1) TextDragon端到端模型提出
(2) 可微的RoISlide将识别与检测统一到一起
(3) 仅仅使用单词/行级别标注完成训练

方法

本文方法：通过主干网络从图像中抽取特征，然后使用文本检测器来描述一系列基于中心线定位的四边形文本。然后使用RoISlide从特征图中沿着中心线抽取特征，其中的局部转换网络将每一个四边形中的特征转化为校正后的特征。最后，使用CNN来对每一个四边形的特征进行分类，使用CTC解码器解码出最终的文本序列。
在这里插入图片描述

文本检测

为了解决不同尺度文字识别的问题，本文采用多层特征图融合，将融合特征图上采样至原图像的1/4大小。
输出模块包括：Centerline Segmentation和Local Box Regression。

Centerline Segmentation: 中心线分割的主要目的是，找到文本的中心线。主要方法是将文本的中心线附近的像素预测为1，其余像素预测为0（也就是非文本区域）。为了解决中心线区域像素与非文本像素个数不均衡的问题，本文参考了[40]，采用**online hard example mining(OHEM)**方法。

损失函数: $L_{s e g}=\frac{1}{|S|} \sum_{s \in S} L\left(p_{s}, p_{s}^{*}\right) =\frac{1}{|S|} \sum_{s \in S}\left(-p_{s}^{*} \log p_{s}-\left(1-p_{s}^{*}\right) \log \left(1-p_{s}\right)\right)$
其中 $∣ S ∣$ 表示由OHEM选中的元素的个数， $p_s$ 为网络对该点的二分类结果， $p_s^*$ 为ground truth， $p_{s}^{*} \in\{0,1\}$ 。

Local Box Regression: 这一步操作主要是得到bounding box。每一个box由两个参数表示，一个是高度，另一个是角度，如图3所示。

损失函数:
$\left[\begin{array}{c}L_{B} \\ L_{\theta}\end{array}\right]=\frac{1}{|P|} \sum_{i \in P} \operatorname{Smooth}_{L_{1}}\left[\begin{array}{c}B_{i}-B_{i}^{*} \\ \theta_{i}-\theta_{i}^{*}\end{array}\right]\\ L_{r e g}=L_{B}+\lambda_{\theta} L_{\theta}$
其中， $P$ 为正样本区域（文本中心线区域）， $B_i$ 和 ${\theta}_i$ 代表所预测得到的box和角度， $B_i^*$ 和 $\theta_i^*$ 代表ground-truth， $\lambda_i$ 是超参（本文实验取10）,本文选择SmoothL1损失[36]是因为它对对象形状变化具有鲁棒性。

RoISlide

本文提出的RoISlide是通过按顺序变换每一个局部四边形，从而将全部的文本特征间接地变换为轴对称的特征。主要分为以下两步：1.首先，我们排列沿文本中心线分布的四边形。 2.使用了Local Transform Network(LTN)，以滑动方式将从每个四边形裁剪的特征图转换为已校正的特征图。经过这两步，特征图变为了有序方形特征图，如图4。
在这里插入图片描述

文本识别

本文采用了一系列的卷积层来代替[45][46]的LSTM。具体操作见表1。
在这里插入图片描述
文字识别主要包含两个操作：文字分类器和转录层。分类器用于将上一步输入的方形特征图转化为文本的概率，转录层则将概率映射为英文字符。

其中，在转录层中，本文使用了CTC解码器[9]，CTC目的是将概率分布转化为文本序列。

文字识别的损失函数为： $L_{r e c}=-\frac{1}{M} \sum_{m=1}^{M} \log p(y | X)$
则整个端到端训练的损失函数为： $L=L_{s e g}+\lambda_{r e g} L_{r e g}+\lambda_{r e c} L_{r e c}$ ，其中 $\lambda_{rec}$ 和 $\lambda_{reg}$ 是超参。

推理

推理步骤如图5所示：
在这里插入图片描述

分组：本文根据几何关系进行分组。
排序：1.检测同组中的box整体是水平的还是垂直的。
采样：对于边界生成，本文只需对有序框进行均匀采样以形成多边形的顶点。然后，通过顺序连接顶点来生成文本边界。
识别：执行RoISlide和CTC。

实验

在这里插入图片描述

在这里插入图片描述
端到端 vs. 非端到端：图6中可看出，端到端训练可以提升非显著文本的检测率。
RoISlide vs. RoIRotate：表2和3和图6(c,d)中可看出，RoIRotate[29]不适合弯曲文本检测，RoISlide和RoIRotate对于常规文本有着相似的效果。
Spotting with vs. without LSTM：基于CNN的文本识别器比LSTM快4倍。

参考文献

列出博文中引用原文的部分文献

[32] Shangbang Long, Jiaqiang Ruan, Wenjie Zhang, Xin He, Wenhao Wu, and Cong Yao. Textsnake: A flexible rep- resentation for detecting text of arbitrary shapes. In Euro- pean Conference on Computer Vision (ECCV), pages 19–35. Springer, 2018.

[31] Yuliang Liu, Lianwen Jin, Shuaitao Zhang, and Sheng Zhang. Detecting curve text in the wild: New dataset and new solution. In arXiv preprint arXiv:1712.02170, 2017.

[46] Xiaobing Wang, Yingying Jiang, Zhenbo Luo, Cheng-Lin Liu, Hyunsoo Choi, and Sungjin Kim. Arbitrary shape scene text detection with adaptive text region representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

[39] Baoguang Shi, Xinggang Wang, Pengyuan Lyu, Cong Yao, and Xiang Bai. Robust scene text recognition with auto- matic rectification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4168–4176, 2016.

[28] Wei Liu, Chaofeng Chen, Kwan-Yee K Wong, Zhizhong Su, and Junyu Han. Star-net: A spatial attention residue network for scene text recognition. In BMVC, volume 2, page 7, 2016.

[5] ZhanzhanCheng,XuyangLiu,FanBai,YiNiu,ShiliangPu, and Shuigeng Zhou. Arbitrarily-oriented text recognition. In arXiv preprint arXiv:1711.04226, 2017.

[25] Hui Li, Peng Wang, and Chunhua Shen. Towards end-to- end text spotting with convolutional recurrent neural net- works. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 5238–5246, 2017.

[29] Xuebo Liu, Ding Liang, Shi Yan, Dagui Chen, Yu Qiao, and Junjie Yan. Fots: Fast oriented text spotting with a unified network. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 5676– 5685, 2018.

[35] Yash Patel, Michal Busˇta, and Jiri Matas. E2e-mlt - an unconstrained end-to-end method for multi-language scene text. In arXiv preprint arXiv:1801.09919, 2018.

[40] Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training region-based object detectors with online hard ex- ample mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), page 761–769, 2016.

[36] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with re- gion proposal networks. In Advances in Neural Information Processing Systems (NIPS), pages 91–99, 2015.

[45] Tao Wang, David J Wu, Adam Coates, and Andrew Y Ng. End-to-end text recognition with convolutional neural net- works. In Proceedings of the International Conference on Pattern Recognition (ICPR), pages 3304–3308. IEEE, 2012.

[9] Alex Graves, Santiago Ferna ́ndez, Faustino Gomez, and Ju ̈rgen Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the International Conference on Machine Learning (ICML), pages 369–376. ACM, 2006.

[11] Kaiming He, Georgia Gkioxari, Piotr Dolla ́r, and Ross Gir- shick. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2980–2988. IEEE, 2017.

[33] Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, and Xiang Bai. Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In European Conference on Computer Vision (ECCV), September 2018.

一叶知秋Autumn

关注

1
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
[论文阅读]TextDragon: An End-to-End Framework for Arbitrary Shaped Text Spotting阅读笔记

TextDragon: An End-to-End Framework for Arbitrary Shaped Text Spotting阅读笔记文章被收录于ICCV2019[论文地址]:http://openaccess.thecvf.com/content_ICCV_2019/html/Feng_TextDragon_An_End-to-End_Framework_for_Arbitra...
复制链接

扫一扫