Mask TextSpotter: An End-to-End TrainableNeural Network for Spotting Text withArbitrary Shapes 中英对翻译

最新推荐文章于 2024-05-25 09:45:48 发布

学不来

最新推荐文章于 2024-05-25 09:45:48 发布

阅读量1.2k

点赞数 1

分类专栏：倾斜文本检测文章标签：深度学习神经网络计算机视觉

本文链接：https://blog.csdn.net/weixin_43882068/article/details/124544180

版权

论文提出了一种名为Mask TextSpotter的端到端可训练神经网络，用于识别自然图像中的任意形状文本。该模型借鉴了Mask R-CNN，能精确检测和识别不规则形状的文本，如弯曲的文本。在ICDAR2013、ICDAR2015和Total-Text数据集上，Mask TextSpotter在场景文本检测和端到端识别任务中表现出最先进的性能。

摘要由CSDN通过智能技术生成

Mask TextSpotter：一种用于识别任意形状文本的端到端可训练神经网络

摘要

最近，基于深度神经网络的模型在场景文本检测和识别领域占据主导地位。在本文中，我们研究了场景文本识别的问题，其目的是在自然图像中同时进行文本检测和识别。我们提出了一个用于场景文本识别的端到端可训练神经网络模型。该模型被命名为Mask TextSpotter，其灵感来源于新近发表的Mask R-CNN。与之前的方法不同，这些方法也是通过端到端可训练的深度神经网络来完成文本发现。Mask TextSpotter利用了简单而流畅的端到端学习程序，在这个过程中，通过语义分割来做到精确的文本检测和识别。此外，它在处理不规则形状的文本实例方面优于以前的方法，例如。弯曲的文本。在ICDAR2013、ICDAR2015和Total-Text上进行的实验表明，所提出的方法在场景文本检测和端到端文本检测方面取得了最先进的结果在场景文本检测和端到端文本识别任务中都取得了最先进的结果。

Abstract. Recently, models based on deep neural networks have dominated the fields of scene text detection and recognition. In this paper, we investigate the problem of scene text spotting, which aims at simultaneous text detection and recognition in natural images. An end-to-end trainable neural network model for scene text spotting is proposed. The proposed model, named as Mask TextSpotter, is inspired by the newly published work Mask R-CNN. Different from previous methods that also accomplish text spotting with end-to-end trainable deep neural networks, Mask TextSpotter takes advantage of simple and smooth end-to-end learning procedure, in which precise text detection and recognition are acquired via semantic segmentation. Moreover, it is superior to previous methods in handling text instances of irregular shapes, for example, curved text. Experiments on ICDAR2013, ICDAR2015 and Total-Text demonstrate that the proposed method achieves state-of-the-art results in both scene text detection and end-to-end text recognition tasks.

1 介绍

近年来，场景文本检测和识别吸引了计算机视觉界越来越多的研究兴趣，特别是在神经网络的复兴和图像数据集的增长之后。场景文本检测和识别提供了一种自动、快速的方法来获取自然场景中体现的文本信息，有利于各种现实世界的应用，如地理定位[58]、即时翻译和盲人援助。场景文本识别的目的是同时定位和识别自然场景中的文本，之前已经有很多作品研究过[49,21]。然而，除了[27]和[3]，在大多数作品中，文本检测和随后的识别是分开处理的。文本区域首先由经过训练的检测器从原始图像中猎取，然后送入识别模块。这个过程看起来简单而自然，但可能会导致检测和识别的次优表现，因为这两项任务是高度相关和互补的。一方面，检测的质量在很大程度上决定了识别的准确性；另一方面，识别的结果可以提供反馈，帮助在检测阶段拒绝假阳性。

In recent years, scene text detection and recognition have attracted growing research interests from the computer vision community, especially after the revival of neural networks and growth of image datasets. Scene text detection and recognition provide an automatic, rapid approach to access the textual information embodied in natural scenes, benefiting a variety of real-world applications, such as geo-location [58], instant translation, and assistance for the blind. Scene text spotting, which aims at concurrently localizing and recognizing text from natural scenes, have been previously studied in numerous works [49,21]. However, in most works, except [27] and [3], text detection and subsequent recognition are handled separately. Text regions are first hunted from the original image by a trained detector and then fed into a recognition module. This procedure seems simple and natural, but might lead to sub-optimal performances for both detection and recognition, since these two tasks are highly correlated and complementary. On one hand, the quality of detections larges determines the accuracy of recognition; on the other hand, the results of recognition can provide feedback to help reject false positives in the phase of detection.

最近，有两种方法[27, 3]被提出，它们为场景文本识别设计了端到端的可训练框架。受益于检测和识别之间的互补性，这些统一的模型大大超过了以前的竞争对手。然而，[27]和[3]有两个主要缺点。首先，它们都不能完全以端到端的方式进行训练。[27]在训练期应用了curriculum learning paradigm[1]，在早期迭代中锁定文本识别子网络，并仔细选择每个阶段的训练数据。Busta等人[3]起初分别对检测和识别的网络进行预训练，然后联合训练，直到收敛。主要有两个原因阻止了[27]和[3]以平稳、端到端的方式训练模型。一个是文本识别部分需要准确的位置进行训练，而早期迭代的位置通常是不准确的。另一个是采用的LSTM[17]或CTC损失[11]比一般的CNN难以优化。[27]和[3]的第二个局限性在于这些方法只关注于阅读水平或方向的文本。然而，现实世界场景中的文本实例的形状可能会有很大的不同，从水平或定向的，到弯曲的形式。

Recently, two methods [27, 3] that devise end-to-end trainable frameworks for scene text spotting have been proposed. Benefiting from the complementarity between detection and recognition, these unified models significantly outperform previous competitors. However, there are two major drawbacks in [27] and [3]. First, both of them can not be completely trained in an end-to-end manner. [27] applied a curriculum learning paradigm [1] in the training period, where the sub-network for text recognition is locked at the early iterations and the training data for each period is carefully selected. Busta et al. [3] at first pre-train the networks for detection and recognition separately and then jointly train them until convergence. There are mainly two reasons that stop [27] and [3] from training the models in a smooth, end-to-end fashion. One is that the text recognition part requires accurate locations for training while the locations in the early iterations are usually inaccurate.The other is that the adopted LSTM [17] or CTC loss [11] are difficult to optimize than general CNNs. The second limitation of [27] and [3] lies in that these methods only focus on reading horizontal or oriented text. However, the shapes of text instances in real-world scenarios may vary significantly, from horizontal or oriented, to curved forms.

在本文中，我们提出了一个名为Mask TextSpotter的文本识别器，它可以检测和识别任意形状的文本实例。这里，任意形状是指现实世界中各种形式的文本实例。受Mask RCNN[13]的启发，它可以生成物体的形状掩码，我们通过分割实例文本区域来检测文本。因此，我们的检测器能够检测任意形状的文本。此外，与之前基于序列的识别方法[45, 44, 26]不同，这些方法是为一维序列设计的，我们通过二维空间的语义分割来识别文本，以解决阅读不规则文本实例的问题。另一个优点是，它不需要准确的位置来识别。因此，检测任务和识别任务可以完全进行端到端的训练，并受益于特征共享和联合优化。

In this paper, we propose a text spotter named as Mask TextSpotter, which can detect and recognize text instances of arbitrary shapes. Here, arbitrary shapes mean various forms text instances in real world. Inspired by Mask RCNN [13], which can generate shape masks of objects, we detect text by segment the instance text regions. Thus our detector is able to detect text of arbitrary shapes. Besides, different from the previous sequence-based recognition methods [45, 44, 26] which are designed for 1-D sequence, we recognize text via semantic segmentation in 2-D space, to solve the issues in reading irregular text instances. Another advantage is that it does not require accurate locations for recognition. Therefore, the detection task and recognition task can be completely trained end-to-end, and benefited from feature sharing and joint optimization.

图1：不同的文本点化方法的图示。左边是水平方向的文本识别方法[30, 27]；中间是定向的文本识别方法[3]；右边是我们提出的方法。方法[3]；右边是我们提出的方法。绿色边框：检测结果结果；绿色背景中的红色文字：识别结果。

Fig. 1: Illustrations of different text spotting methods. The left presents horizontal text spotting methods [30, 27]; The middle indicates oriented text spotting methods [3]; The right is our proposed method. Green bounding box: detection result; Red text in green background: recognition result.

我们在包括水平、定向和弯曲文本的数据集上验证了我们模型的有效性。结果表明，所提出的算法在文本检测和端到端文本识别任务中都有优势。特别是在ICDAR2015上，在单一规模的评估中，我们的方法在检测任务上达到了0.86的F-Measure，在端到端识别任务上超过了之前的顶级表现者13.2%-25.3%。

We validate the effectiveness of our model on the datasets that include horizontal, oriented and curved text. The results demonstrate the advantages of the proposed algorithm in both text detection and end-to-end text recognition tasks. Specially, on ICDAR2015, evaluated at a single scale, our method achieves an F-Measure of 0.86 on the detection task and outperforms the previous top performers by 13.2% − 25.3% on the end-to-end recognition task.

本文的主要贡献有四个方面。(1) 我们提出了一个用于文本识别的端到端可训练模型，它享有一个简单、流畅的训练方案。(2) 本文提出的方法可以检测和识别各种形状的文本，包括水平的、定向的和弯曲的文本。(3) 与以前的方法相比，我们的方法中精确的文本检测和识别是通过语义分割完成的。(4) 我们的方法在各种基准上的文本检测和文本识别方面都取得了最先进的表现。

The main contributions of this paper are four-fold. (1) We propose an endto-end trainable model for text spotting, which enjoys a simple, smooth training scheme. (2) The proposed method can detect and recognize text of various shapes, including horizontal, oriented, and curved text. (3) In contrast to previous methods, precise text detection and recognition in our method are accomplished via semantic segmentation. (4) Our method achieves state-of-the-art performances in both text detection and text spotting on various benchmarks.

2 Related Work

2.1 场景文本检测 Scene Text Detection

在场景文本识别系统中，文本检测起着重要作用[59]。已经提出了大量的方法来检测场景文本[7, 36, 37, 50, 19, 23, 54, 21, 47, 54, 56, 30, 52, 55, 34, 15, 48, 43, 57, 16, 35, 31] 。在[21]中，Jaderberg等人使用Edge Boxes[60]来生成 proposals，并通过回归来完善候选框。Zhang等人[54]通过利用文本的对称性来检测场景文本。改编自Faster R-CNN[40]和SSD[33]，并进行了精心设计的修改，[56，30]被提出来检测水平词。

In scene text recognition systems, text detection plays an important role [59]. A large number of methods have been proposed to detect scene text [7, 36, 37, 50, 19, 23, 54, 21, 47, 54, 56, 30, 52, 55, 34, 15, 48, 43, 57, 16, 35, 31]. In [21], Jaderberg et al. use Edge Boxes [60] to generate proposals and refine candidate boxes by regression. Zhang et al. [54] detect scene text by exploiting the symmetry property of text. Adapted from Faster R-CNN [40] and SSD [33] with well-designed modifications, [56, 30] are proposed to detect horizontal words.

多方向的场景文本检测已经成为最近的一个热门话题。Yao等人[52]和Zhang等人[55]通过语义分割检测多方位场景文本。Tian等人[48]和Shi等人[43]提出的方法首先检测文本片段，然后通过空间关系或链接预测将它们链接成文本实例。Zhou等人[57]和He等人[16]直接从密集的分割图中回归文本框。Lyu等人[35]提出检测和分组文本的角点来生成文本框。Liao等人[31]提出了对旋转敏感的回归，用于定向场景文本检测。

Multi-oriented scene text detection has become a hot topic recently. Yao et al. [52] and Zhang et al. [55] de