【论文翻译】Scene Text Detection and Recognition: The Deep Learning Era（上）-CSDN博客

摘要- 随着深度学习的兴起和发展，计算机视觉得到了极大的改变和重塑。作为计算机视觉领域的一个重要研究领域，场景文本的检测和识别已经不可避免地受到这股革命浪潮的影响，从而进入了深度学习的时代。近年来，这方面的社区在思维、方法和效果上取得了重大进展。本次调研旨在总结和分析深度学习时代场景文本检测与识别的重大变化和重大进展。通过本文，我们致力于：（1）介绍新的见解和想法; （2）突出最近的技术和基准; （3）展望未来趋势。具体而言，我们将强调深度学习带来的巨大差异，仍然存在巨大的挑战。我们希望这篇评论文章可以作为该领域研究人员的参考书。我们的Github存储库也收集整理了相关资源：https://github.com/Jyouhou/SceneTextPapers

Abstract—With the rise and development of deep learning, computer vision has been tremendously transformed and reshaped. As an important research area in computer vision, scene text detection and recognition has been inevitable influenced by this wave of revolution, consequentially entering the era of deep learning. In recent years, the community has witnessed substantial advancements in mindset, methodology and performance. This survey is aimed at summarizing and analyzing the major changes and significant progresses of scene text detection and recognition in the deep learning era. Through this article, we devote to: (1) introduce new insights and ideas; (2) highlight recent techniques and benchmarks; (3) look ahead into future trends. Specifically, we will emphasize the dramatic differences brought by deep learning and the grand challenges still remained. We expect that this review paper would serve as a reference book for researchers in this field. Related resources are also collected and compiled in our Github repository: https://github.com/Jyouhou/SceneTextPapers.

Index Terms—Scene Text, Detection, Recognition, Deep Learning, Survey

1引言

毫无疑问，文本是人类最辉煌和最有影响力的创作之一。作为人类语言的书面形式，文本可以可靠且有效地跨时和空传播、获取信息。从这个意义上讲，文本构成了人类文明的基石。一方面，作为交流与合作的重要工具，文本在现代社会中发挥着前所未有的重要作用;另一方面，文本中体现了丰富的，精确的高级语义可有助于理解我们周围的世界。例如，文本信息可用于各种实际应用，如图像搜索[116]，[134]，即时翻译[23]，[102]，机器人导航[21]，[79]，[80]，[117]和工业自动化[16]，[39]，[47]。因此，从自然环境中自动读取文本（示意图如图1所示），a.k.a。场景文本检测和识别[172]或PhotoOCR [8]，已成为计算机视觉中日益流行和重要的研究课题。然而，尽管经过多年的研究，在野外检测和识别文本时仍可能遇到一系列重大挑战。困难主要来自三个方面[172]：

图1：场景文本检测和识别的示意图。图片来自Total-Text

Fig. 1: Schematic diagram of scene text detection and recognition. The image sample is from Total-Text [15].

1 INTRODUCTION

UNDOUBTEDLY, text is among the most brilliant and influential creations of humankind. As the written form of human languages, text makes it feasible to reliably and effectively spread or acquire information across time and space. In this sense, text constitutes the cornerstone of human civilization.

On the one hand, text, as a vital tool for communication and collaboration, has been playing a more important role than ever in modern society; on the other hand, the rich, precise high level semantics embodied in text could be beneficial for understanding the world around us. For example, text information can be used in a wide range of real-world applications, such as image search [116], [134], instant translation [23], [102], robots navigation [21], [79], [80], [117], and industrial automation [16], [39], [47]. Therefore, au- tomatic text reading from natural environments (schematic diagram is depicted in Fig. 1), a.k.a. scene text detection and recognition [172] or PhotoOCR [8], has become an increasing popular and important research topic in computer vision.

However, despite years of research, a series of grand challenges may still be encountered when detecting and recognizing text in the wild. The difficulties mainly stem from three aspects [172]:

•自然场景中文本的多样性和可变性与文档中的文本不同，自然场景中的文本表现出更高的多样性和可变性。例如，场景文本的实例可以是不同的语言，颜色，字体，大小，方向和形状。此外，场景文本的宽高比和布局可能会有很大差异。所有这些变化对于为自然场景中的文本设计的检测和识别算法提出了挑战。

•背景的复杂性和干扰，自然场景的背景几乎不可预测。可能存在与文本极其相似的模式（例如，树叶，交通标志，砖块，窗户和栅栏），或由异物引起的遮挡，这很可能导致混淆和错误。

• Diversity and Variability of Text in Natural ScenesDistinctive from scripts in documents, text in natural scene exhibit much higher diversity and variability. For example, instances of scene text can be in different languages, colors, fonts, sizes, orientations and shapes. Moreover, the aspect ratios and layouts of scene text may vary significantly. All these variations pose challenges for detection and recognition algorithms designed for text in natural scenes.

•Complexity and Interference of BackgroundsBackgrounds of natural scenes are virtually unpredictable. There might be patterns extremely similar with text (e.g., tree leaves, traffic signs, bricks, windows, and stockades), or occlusions caused by foreign objects, which may potentially lead to confusions and mistakes.

•背景的复杂性和干扰自然场景的背景几乎是不可预测的。可能存在与文本极其相似的模式（例如，树叶，交通标志，砖块，窗户和栅栏），或由异物引起的遮挡，这很可能导致混淆和错误。

•有欠缺的成像条件在不受控制的情况，无法保证文本图像和视频的质量。也就是说，在较差的成像条件下，文本实例可能由于不适当的拍摄距离或角度而具有低分辨率和严重失真，由于失焦或抖动而模糊，或由于低光照水平或者高光影而产生噪点。

贯穿于深度学习显现潜力之前的几年，这些困难同样困扰计算机视觉及其他领域的。随着AlexNet [68]赢得ILSVRC2012 [115]比赛后深度学习的突出成绩，研究人员转向深度神经网络进行自动特征学习，并开始进行更深入的研究。社区的人们甚至制定了更具挑战性的目标。近年来取得的进展可归纳如下：

• Imperfect Imaging ConditionsIn uncontrolled circumstances, the quality of text images and videos could not be guaranteed. That is, in poor imaging conditions, text instances may be with low resolution and severe distortion due to inappropriate shooting distance or angle, or blurred because of out of focus or shaking, or noised on account of low light level, or corrupted by highlights or shadows.

These difficulties run through the years before deep learning showed its potential in computer vision as well as in other fields. As deep learning came to prominence after AlexNet [68] won the ILSVRC2012 [115] contest, researchers turn to deep neural networks for automatic feature learning and start with more in-depth studies. The community are now working on ever more challenging targets. The pro- gresses made in recent years can be summarized as follows:

•融入深度学习几乎所有最近的方法都建立在深度学习模型之上。最重要的是，深度学习使研究人员摆脱了反复设计和测试手工制作功能的繁琐工作，产生了大量的作品，从而再一步推动了研究。具体而言，深度学习的使用大大简化了整个管道。此外，这些算法在标准基准测试中比以前的算法提有了显着的提升。基于梯度的训练常规化也有助于端到端的可训练方法。

• Incorporation of Deep LearningNearly all recent methods are built upon deep learning models. Most importantly,deep learning frees researchers from the exhausting work of repeatedly designing and testing hand-crafted features, which gives rise to a blossom of works that push the envelope further. To be specific, the use of deep learning substantially simplifies the overall pipeline. Besides, these algorithms provide significant improvements over previous ones on standard benchmarks. Gradient-based training routines also facilitate to end-to-end trainable methods.

•面向目标的算法和数据集研究人员现在转向更具体的方面和目标。针对现实世界场景中的困难，新发布的数据集正收集具有独特且具有代表性的特征的。例如，存在分别具有长文本，模糊文本和弯曲文本的数据集。在这些数据集的推动下，近年来发布的几乎所有算法都旨在应对特定的挑战。例如，一些被提议用于检测定向文本，而另一些用于模糊和未聚焦的场景图像。这些想法也被结合起来产生更通用的方法。

•辅助技术的进步除了专门用于主要任务的新数据集和模型之外，不直接解决任务的辅助技术也会在此领域找到它们的位置，例如合成数据和数据增强。

在本次调查中，我们概述了近期在场景文本检测和识别方面的发展，重点关注深度学习时代。我们从不同的角度审查方法，并列出最新的数据集。我们还分析现状并预测未来的研究趋势。

• Target-Oriented Algorithms and Datasets Researchers are now turning to more specific aspects and targets. Against difficulties in real-world scenarios, newly published datasets are collected with unique and representative characteristics. For example, there are datasets that feature long text, blurred text, and curved text respectively. Driven by these datasets, almost all algorithms published in recent years are designed to tackle specific challenges. For instance, some are proposed to detect oriented text, while others aim at blurred and unfocused scene images. These ideas are also combined to make more general-purpose methods.

• Advances in Auxiliary TechnologiesApart from new datasets and models devoted to the main task, auxiliary technologies that do not solve the task directly also find their places in this field, such as synthetic data and bootstrapping.

In this survey, we present an overview of recent devel-opment in scene text detection and recognition with focus on the deep learning era. We review methods from different perspectives, and list the up-to-date datasets. We also analyze the status quo and predict future research trends.

已经有几篇优秀的评论文章[136]，[154]，[160]，[172]，它们还梳理和分析与作品相关的文本检测和识别。然而，这些论文是在深度学习在这一领域突出之前发表的。因此，他们主要关注更传统和基于特征的方法。我们向读者推荐这些论文，以便更全面地了解历史。本文将主要关注从静止图像而非视频中提取文本信息。有关视频中的场景文本检测和识别，请参阅[60]，[160]。

本文的其余部分安排如下。在第2节中，我们简要回顾了深度学习时代之前的方法。在第3节中，我们按层次顺序列出并总结了基于深度学习的算法。在第4节中，我们将介绍数据集和评估协议。最后，我们介绍了潜在的应用以及我们对当前状态和未来趋势的看法。

There have been already several excellent review papers [136], [154], [160], [172], which also comb and analyze works related text detection and recognition. How- ever, these papers are published before deep learning came to prominence in this field. Therefore, they mainly focus on more traditional and feature-based methods. We refer readers to these paper as well for a more comprehensive view and knowledge of the history. This article will mainly concentrate on text information extraction from still images, rather than videos. For scene text detection and recognition in videos, please also refer to [60], [160].

The remaining parts of this paper are arranged as follows. In Section 2, we briefly review the methods before the deep learning era. In Section 3, we list and summarize algorithms based on deep learning in a hierarchical order. In Section 4, we take a look at the datasets and evaluation protocols. Finally, we present potential applications and our own opinions on the current status and future trends.

2 在深度学习之前的方法

2.1概述

在本节中，我们回顾一下在深度学习时代之前的算法。这些作品的更详细和全面的报道可以在[136]，[154]，[160]，[172]中找到。对于文本检测和识别，焦点一直是特征的设计。

在这段时间内，大多数文本检测方法采用连通成分分析（CCA）[24]，[52]，[58]，[98]，[135]，[156]，[159]或基于分类的滑动窗口（SW）[17]，[70]，[142]，[144]。基于CCA的方法首先通过各种方式（例如，颜色聚类或极端区域提取）提取候选组件，然后使用手工设计的规则或自动训练的手工制作的分类器过滤掉非文本组件（参见图2）。在滑动窗口分类方法中，不同大小的窗口在输入图像上滑动，其中每个窗口被分类为是否文本段/区域。分类为正的那些进一步使用态学操作分为的文本区域[70]，条件随机场（CRF）[142]和其他基于图像操作的替代方法[17]，[144]。

2 METHODS BEFORE THE DEEP LEARNING ERA

2.1 Overview

In this section, we take a brief glance retrospectively at algo- rithms before the deep learning era. More detailed and com- prehensive coverage of these works can be found in [136], [154], [160], [172]. For text detection and recognition, the attention has been the design of features.

In this period of time, most text detection methods either adopt Connected Components Analysis (CCA) [24], [52], [58], [98], [135], [156], [159] or Sliding Window (SW) based classification [17], [70], [142], [144]. CCA based methods first extract candidate components through a variety of ways (e.g., color clustering or extreme region extraction), and then filter out non-text components using manually designed rules or classifiers automatically trained on hand-crafted features (see Fig.2). In sliding window classification meth- ods, windows of varying sizes slide over the input image, where each window is classified as text segments/regions or not. Those classified as positive are further grouped into text regions with morphological operations [70], Conditional Random Field (CRF) [142] and other alternative graph based methods [17], [144].

对于文本识别，一个分支采用基于特征的方法。施等人。[126]和姚等人。[153]提出了基于字符段的识别算法。罗德里格兹等人。[109]，[110]和戈多等人。[35]和Almazan等人。[3]利用标签嵌入直接执行字符串和图像之间的匹配。斯托克[10]和字符关键点[104]也被检测为分类特征。另一个将识别过程分解为一系列子问题。已经提出了各种方法来解决这些子问题，包括文本二值化[71]，[93]，[139]，[167]，文本行分割[155]，字符分割[101]，[114] ，[127]，单字符识别[12]，[120]和字词校正[62]，[94]，[138]，[145]，[165]。

已经有人致力于集成（即我们今天称之为端对端）系统[97]，[142]。在Wang等人。[142]，字符被认为是对象检测中的特殊情况，由最近邻训练者在HOG特征[19]上训练后检测，然后通过基于图形结构（PS）的模型分组为单词[26]。Neumann和Matas [97]通过将每个字符的多个分段保持到每个字符的上下文已知的最后阶段来提出决策延迟方法。他们通过动态编程算法使用外部区域和解码识别结果检测字符分割。

总之，文本检测和识别方法在深度学习时代之前主要提取低级或中级手工制作的图像特征，这需要苛刻和重复的预处理和后处理步骤。由于手工制作的特征的有限表现能力和管道的复杂性，这些方法很难处理复杂的情况，例如：ICDAR2015数据集中模糊的图像[63]。

图2：具有手工制作特征的传统方法的示意图：（1）最大稳定极值区域（MSER）[98]，假设每个字符内部具有色彩一致性; （2）笔画宽度变换（SWT）[24]，假设每个字符内的笔画宽度一致。

Fig. 2: Illustration of traditional methods with hand- crafted features: (1) Maximally Stable Extremal Regions (MSER) [98], assuming chromatic consistency within each character; (2) Stroke Width Transform (SWT) [24], assuming consistent stroke width within each character.

For text recognition, one branch adopted the feature- based methods. Shi et al. [126] and Yao et al. [153] pro- posed character segments based recognition algorithms. Rodriguez et al. [109], [110] and Gordo et al. [35] and Almazan et al. [3] utilized label embedding to directly per- form matching between strings and images. Stoke [10] and character key-points [104] are also detected as features for classification. Another discomposed the recognition process into a series of sub-problems. Various methods have been proposed to tackle these sub-problems, which includes text binarization [71], [93], [139], [167], text line segmenta- tion [155], character segmentation [101], [114], [127], single character recognition [12], [120] and word correction [62], [94], [138], [145], [165].

There have been efforts devoted to integrated (i.e. end- to-end as we call it today) systems as well [97], [142]. In Wang et al. [142], characters are considered as a special case in object detection and detected by a nearest neighbor clas- sifier trained on HOG features [19] and then grouped into words through a Pictorial Structure (PS) based model [26]. Neumann and Matas [97] proposed a decision delay ap- proach by keeping multiple segmentations of each character until the last stage when the context of each character is known. They detected character segmentations using ex- tremal regions and decoded recognition results through a dynamic programming algorithm.

In summary, text detection and recognition methods

before the deep learning era mainly extract low-level or mid- level hand crafted image features, which entails demanding and repetitive pre-processing and post-processing steps. Constrained by the limited representation ability of hand crafted features and the complexity of pipelines, those meth- ods can hardly handle intricate circumstances, e.g. blurred images in the ICDAR2015 dataset [63].

3 METHODOLOGY IN THE DEEP LEARNING ERA

正如本节标题所暗示的那样，我们不仅仅是讲新的方法而是希望介绍最近的进步对方法的改变。我们基于观察得出的结论，将在以下段落中解释。

近年来的方法具有以下两个特点：（1）大多数方法采用基于深度学习的模型; （2）大多数研究人员正在从多种角度处理问题。深度学习驱动的方法优势在于自动特征学习可以使我们免于设计和测试大量潜在的手工提取特征。与此同时，拥有不同观点的研究人员正在丰富和促进社区进行更深入的工作，针对不同的目标，例如更快更简单的管道[168]，不同宽高比的文本[121]和合成数据[38]。正如我们在本节中还可以看到的那样深度学习的融入彻底改变了研究人员处理任务的方式，并扩大了研究范围。与前一个时代相比，这是最显着的变化。

简而言之，近年来见证了对细分趋势的研究的蓬勃发展。我们总结了图3中的这些变化和趋势，我们将在调查中遵循此图。在本节中，我们将现有方法用垂直的类图进行分类，并以自上而下的方式引入。首先，我们将它们分为四种系统：（1）文本检测检测和定位自然图像中文本存在; （2）识别系统将检测到的文本区域的内容转录并转换为语言符号; （3）端到端系统在单一管道中执行文本检测和识别; （4）辅助方法旨在支持文本检测和识别的主要任务，例如，合成数据生成和图像去模糊。在每个系统下，我们从不同的角度审查最近的方法。

Fig. 3: Overview of recent progress and dominant trends.

As implied by the title of this section, we would like to address recent advances as changes in methodology instead of merely new methods. Our conclusion is grounded in the observations as explained in the following paragraph.

Methods in the recent years are characterized by the following two distinctions: (1) Most methods utilizes deep- learning based models; (2) Most researchers are approach- ing the problem from a diversity of perspectives. Methods driven by deep-learning enjoy the advantage that automatic feature learning can save us from designing and testing the large amount potential hand-crafted features. At the same time, researchers from different viewpoints are enriching and promoting the community into more in-depth work, aiming at different targets, e.g. faster and simpler pipeline [168], text of varying aspect ratios [121], and synthetic data [38]. As we can also see further in this section, the

incorporation of deep learning has totally changed the way researchers approach the task, and has enlarged the scope of research by far. This is the most significant change compared to the former epoch.

In a nutshell, recent years have witness a blossoming ex- pansion of research into subdivisible trends. We summarize these changes and trends in Fig.3, and we would follow this diagram in our survey.

In this section, we would classify existing methods into a hierarchical taxonomy, and introduce in a top-down style. First, we divide them into four kinds of systems: (1) text detection that detects and localizes the existence of text in natural image; (2) recognition system that transcribes and converts the content of the detected text region into linguistic symbols; (3) end-to-end system that performs both text detection and recognition in one single pipeline; (4) auxiliary methods that aim to support the main task of text detection and recognition, e.g. synthetic data generation, and deblurring of image. Under each system, we review recent methods from different perspectives.

3.1 Detection

文本检测领域有三个主要趋势，我们将逐一介绍以下小节。它们是：（1）管道简化; （2）预测单位的变化; （3）指定目标。

There are three main trends in the field of text detection, and we would introduce them in the following sub-sections one by one. They are: (1) pipeline simplification; (2) changes in prediction units; (3) specified targets.

其中一个重要趋势是管道的简化，如图4所示。在深度学习时代之前的大多数方法，以及一些使用深度学习的早期方法都有多步骤流水线。最近的方法简化了管道，缩短了管道，这是减少错误传播和简化训练过程的关键。最近，经过联合训练的方法超越了经过单独训练的两阶段方法。这些方法的主要组成部分是端到端可区分模块，这是一个突出的特点。

图4：场景文本检测和识别的典型步骤。（a）[55]和（b）[152]是代表性的多步骤方法。（c）和（d）是简化的步骤。（c）[168]仅包含检测分支，因此与单独的识别模型一起使用。（d）[45]，[81]联合训练检测模型和识别模型。

Fig. 4: Typical pipelines of scene text detection and recog- nition. (a) [55] and (b)[152] are representative multi-step methods. (c) and (d) are simplified pipeline. (c) [168] only contains detection branch, and therefore is used together with a separate recognition model. (d) [45], [81] jointly train a detection model and recognition model.

One of the important trends is the simplification of the pipeline, as shown in Fig.4. Most methods before the era of deep-learning, and some early methods that use deep- learning, have multi-step pipelines. More recent methods have simplified and much shorter pipelines, which is a key to reduce error propagation and simplify the training pro- cess. More recently, separately trained two-staged methods are surpassed by jointly trained ones. The main components of these methods are end-to-end differentiable modules, which is an outstanding characteristic.

多步骤方法：早期基于深度学习的方法[152]，[166] 1，[41]将文本检测的任务转变为多步骤过程。在[152]中，卷积神经网络用于预测输入图像：（1）每个像素是否属于字符，（2）是否在文本区域内，以及（3）围绕像素的文本方向。连接的肯定响应被认为是字符或文本区域。对于属于相同文本区域的字符，应用Delaunay三角剖分[61]，之后基于预测的方向属性的图分区将字符分组为文本行。

Multi-step methods: Early deep-learning based methods [152], [166]1, [41] cast the task of text detection into a multi- step process. In [152], a convolutional neural network is used to predict whether each pixel in the input image (1) belongs to a character, (2) is inside the text region, and (3) the text orientation around the pixel. Connected positive responses are considered as a detection of character or text region. For characters belonging to the same text region, Delaunay triangulation [61] is applied, after which graph partition based on the predicted orientation attribute groups characters into text lines.

类似地，[166]首先预测指示哪些像素在文本行区域内的密集图。对于每个文本行区域，应用MSER [99]来提取候选字符。候选字符显示底层文本行的比例和方向的信息。最后，提取最小边界框作为最终文本行候选。

在[41]中，检测过程也包括几个步骤。首先，提取文本块。然后模型裁剪并仅关注提取的文本块以提取文本中心线（TCL），其被定义为原始文本行的缩小版本。每个文本行代表一个文本实例的存在。其次将提取的TCL映射分成几个TCL。然后将每个分割TCL连接到原始图像。最后，语义分割模型将每个像素分类为与给定TCL属于同一文本实例的像素或不属于同一文本实例的像素。

Similarly, [166] first predicts a dense map indicating which pixels are within text line regions. For each text line region, MSER [99] is applied to extract character candidates. Character candidates reveal information of the scale and orientation of the underlying text line. Finally, minimum bounding box is extracted as the final text line candidate.

In [41], the detection process also consists of several steps. First, text blocks are extracted. Then the model crops and only focuses on the extracted text block to extract text center line(TCL), which is defined to be a shrunk version of the original text line. Each text line represents the existence of one text instance. The extracted TCL map is then split into several TCLs. Each split TCL is then concatenated to the original image. A semantic segmentation model then classifies each pixel into ones that belong to the same text instance as the given TCL, and ones that do not.

简化管道：最近的方法[44] 2，[59]，[73] 3，[82]，[121] 4，[163] 5，[90] 6，[111]，[74] 7，[119]遵循两步流程，包括端到端可训练的神经网络模型和后处理步骤，后者通常比以前的步骤简单得多。这些方法主要从一般物体检测技术中汲取灵感[27]，[30]，[31]，[42]，[76]，[107]，并受益于可以直接预测文本实例的高度集成的神经网络模块。主要有两个分支：（1）基于锚点的方法[44]，[73]，[82]，[121]预测文本的存在，并仅在输入图像的预定义网格点处回归位置偏移; （2）区域提议方法[59]，[74]，[90]，[111]，[119]，[163]基于提取的图像区域进行预测和回归。由于大多数这些工作的初衷并不仅仅是简化管道，我们在这里只介绍一些有代表性的方法。其他工作将在以下部分介绍。

Simplified pipeline: More recent methods [44]2, [59], [73]3, [82], [121]4, [163]5, [90]6, [111], [74]7, [119] follow a 2-step pipeline, consisting of an end-to-end trainable neural network model and a post-processing step that is usually much simpler than previous ones. These methods mainly draw inspiration from techniques in general object detection [27], [30], [31], [42], [76], [107], and benefit from the highly integrated neural network modules that can predict text instances directly. There are mainly two branches: (1) Anchor-based methods [44], [73], [82], [121] that predict the existence of text and regress the location offset only at pre-defined grid points of the input image; (2) Region proposal methods [59], [74], [90], [111], [119], [163] that predict and regress on the basis of extracted image region.

Since the original targets of most of these works are not merely the simplification of pipeline, we only introduce some representative methods here. Other works will be introduced in the following parts.

基于锚点的方法从SSD [76]（一般物体检测网络）中汲取灵感。如图5（b）所示，代表作品TextBoxes [73]，专门调整SSD网络以适应文本行的不同方向和宽高比。更特别的，在每个锚点处，四边形替换默认矩形框，可以更紧密地捕获文本行并减少噪声。EAST是一种基于锚的默认框的预测方法的变体[168] 8。在标准SSD网络中，有几个不同大小的特征图，其中检测到不同感受野的默认框。在EAST，所有特征图都是通过上采样或是U-Net [113]逐渐集成在一起的。最终特征图的尺寸是原始输入图像的1/4，有c个通道。假设每个像素仅属于一个文本行的情况下，最终特征图上的每个像素，即1×1×c特征张量，用于回归矩形或四边形文本行的边界框。具体地说，存在文本，即文本/非文本和几何形状，例如，预测矩形的方向和大小，以及四边形的顶点坐标。EAST通过高度简化的流水线和效率，对文本检测领域产生了影响。由于EAST以其速度而闻名，我们将在后期重新引入EAST，并强调其效率。

图5：基于现有锚点/ roi池化方法的高级图示：（a）与YOLO [105]类似，预测每个锚点位置。代表性方法包括旋转默认框[82]。（b）SSD的变种[76]，包括Textboxes算法[73]，预测不同大小的特征图。（c）边界框的直接回归[168]，也预测每个锚位置。（d）基于区域提案的方法，包括旋转兴趣区域（RoI）[90]和不同长宽比的RoI [59]。

Fig. 5: High level illustration of existing anchor/roi-pooling based methods: (a) Similar to YOLO [105], predicting at each anchor positions. Representative methods include ro- tating default boxes [82]. (b) Variants of SSD [76], including Textboxes [73], predicting at feature maps of different sizes. (c) Direct regression of bounding boxes [168], also predicting at each anchor position. (d) Region Proposal based methods, including rotating Region of Interests (RoI) [90] and RoI of varying aspect ratios [59].

Anchor-based methods draw inspiration from SSD [76], a general object detection network. As shown in Fig.5 (b), a representative work, TextBoxes [73], adapts SSD network specially to fit the varying orientations and aspect-ratios of text line. Specifically, at each anchor point, default boxes are replaced by default quadrilaterals, which can capture the text line tighter and reduce noise.

A variant of the standard anchor-based default box prediction method is EAST [168]8. In the standard SSD network,there are several feature maps of different sizes, on whichdefault boxes of different receptive fields are detected. In EAST, all feature maps are integrated together by gradual upsampling, or U-Net [113] structure to be specific. The size of the final feature map is 1/4 of the original input image, with c-channels. Under the assumption that each pixel only belongs to one text line, each pixel on the final feature map, i.e. the 1 × 1 × c feature tensor, is used to regress the rectangular or quadrilateral bounding box of the underlying text line. Specifically, the existence of text, i.e. text/non-text, and geometries, e.g. orientation and size for rectangles, and vertexes coordinates for quadrilaterals, are predicted. EAST makes a difference to the field of text detection with its highly simplified pipeline and the efficiency. Since EAST is most famous for its speed, we would re-introduce EAST in later parts, with emphasis on its efficiency.

区域提案方法通常遵循标准R-CNN的对象检测框架[30]，[31]，[107]，采用了一种简单快速的预处理方法，用于提取一组可能包含文本行的区域。然后利用神经网络将其分类为文本/非文本并通过回归边界偏移来校正定位。但是，必须进行调整。

旋转区域提案网络[90]采用了标准的Faster RCNN网络。为了适应任意方向的文本，提出了旋转区域提案来替代标准的轴对齐矩形。

类似地，R2CNN [59]修改标准区域基于提案的对象检测方法。为了适应不同方面的比例，应用了三个不同大小的兴趣区，并连接起来以进一步进行预测和回归。在FEN [119]中，自适应加权池化应用于集成不同尺寸的池化。最终预测是利用文本分数来衡量4种不同池化尺寸。

Region proposal methods usually follow the standard object detection framework of R-CNN [30], [31], [107], where a simple and fast pre-processing method is applied, extracting a set of region proposals that could contain text lines. A neural network then classifies it as text/non-text and

corrects the localization by regressing the boundary offsets.However, adaptations are necessary.

Rotation Region Proposal Networks [90] follows and adapts the standard Faster RCNN framework. To fit into text of arbitrary orientations, rotating region proposals are generated instead of the standard axis-aligned rectangles.

Similarly, R2CNN [59] modifies the standard region proposal based object detection methods. To adapt to the varying aspects ratios, three Region of Interests Poolings of different sizes are used, and concatenated for further prediction and regression. In FEN [119], adaptively weighted poolings are applied to integrated different pooling sizes. The final prediction is made by leveraging the textness score for poolings of 4 different sizes.

3.1.2不同的预测单元

文本检测与一般对象的主要区别检测是，文本整体上是同质的但显示具有局部性，而一般物体检测不是。通过同质性和局部性，我们倾向于认为任何文本实例的一部分仍然是文本。人们不需看整个文本实例知道它属于某些文本。这样的特性奠定了新分支的基础，仅预测子文本组件的文本检测方法然后将它们组装成文本实例。在这一部分中，我们采用基于粒度的视角审视文本检测。有两个主要的预测粒度层次，文本实例级别和子文本级别。

3.1.2 Different Prediction Units

A main distinction between text detection and general object detection is that, text are homogeneous as a whole and show locality, while general object detection are not. By homogeneity and locality, we refer to the property that any part of a text instance is still text. Human do not have to see the whole text instance to know it belongs to some text. Such a property lays a cornerstone for a new branch of text detection methods that only predict sub-text components and then assemble them into a text instance. In this part, we take the perspective of the granularity of text detection. There are two main level of prediction granularity, text instance level and sub-text level.

在文本实例级方法[18]，[46]，[59]，[73]，[74]，[82]，[90]，[119]，[163]，[168]，文本的检测遵循一般物体检测的标准程序，其中一个区域提案网络和细化网络相结合做出预测。区域提案网络对潜在文本实例的位置进行粗略猜测，然后细化网络部分进一步判断提案是否文本，并纠正文字的位置。

相反，子文本级别检测方法[89]，[20] 9，[41]，[148]，[152]，[44] 10，[40]，[121]，[166]，[133] 11，[140]，[171]将部分组合起来制作一个文本实例进行预测。这样的子文本主要包括像素级和组件级。

In text instance level methods [18], [46], [59], [73], [74], [82], [90], [119], [163], [168], detection of text follows the standard routine of general object detection, where a region proposal network and a refinement network are combined to make predictions. The region-proposal network produces

initial and coarse guess for the localization of possible text instance, and then a refinement part discriminates the proposals as text/non-text and also correct the localization of the text.

Contrarily, sub-text level detection methods [89], [20]9, [41], [148], [152], [44]10, [40], [121],

[166], [133]11, [140], [171] only predicts parts that are combined to make a text instance. Such sub-text mainly includes pixel-level and components-level.

在像素级方法[20]，[41]，[44]，[148]，[152]，[166]，端到端的完全卷积神经网络学会了生成密集的预测图，预测每个原始图像中的像素是否属于文本实例。之后，后置处理方法将像素组合在一起判断哪些像素属于同一文本实例。由于文本像素相互连接以簇的方式呈现，像素级方法的要点是如何将文本实例彼此分开。PixelLink[20]学会通过在每个文本实例添加链接预测两个相邻像素是否属于同一文本实例。边界学习方法[148]假设该边框可以很好地分隔文本实例将每个像素投射到三个类别：文本，边框和背景。在Holistic[152]中，像素预测图包括文本块级和字符中心界别。由于字符中心不会重叠，分离很容易完成。

在这部分我们只打算引入预测单位的概念。我们将在特定目标章节再详细介绍分离文本实例。

In pixel-level methods [20], [41], [44], [148], [152], [166], an end-to-end fully convolutional neural network learns to generate a dense prediction map indicting whether each pixel in the original image belongs to any text instances or not. Post-processing methods then groups pixels together depending on which pixels belong to the same text instance. Since text can appear in clusters which makes predicted pixels connected to each other, the core of pixel-level methods is to separate text instances from each other. PixelLink [20] learns to predict whether two adjacent pixels belong to the same text instance by adding link prediction to each pixel. Border learning method [148] casts each pixels into three categories: text, border, and background, assuming that border can well separate text instances. In Holistic [152], pixel-prediction maps include both text-block level and character center levels. Since the centers of characters do not overlap, the separation is done easily.

In this part we only intend to introduce the concept of prediction units. We would go back to details regarding the separation of text instances in the section of Specific Targets.

组件级方法[40]，[89]，[121]，[133]，[140]，[171]通常以中等粒度进行预测。组件是文本示例的一个局部区域，有时包含一个或多个字符。

如图6（a）所示，SegLink [121]修改了SSD的原始框架[76]。不同于SSD的默认框用于表示整个对象，SegLink只使用一种纵横比的默认框用于预测覆盖区域是否属于文本实例。

该地区被称为文本分割。此外，默认框之间的链接也被预测出来用于表示连接段是否属于同一个文本实例。角定位方法[89]提出检测每个文本实例的角落。因为每个文本实例只有4个角，所以预测结果和它们的相对位置可以指出应该将哪些角分组到相同的文本实例。

图6：自下而上列举了代表性方法的示意图：（a）SegLink [121]：以SSD作为基础网络，预测每个锚位置处的字段以及相邻锚之间的连接。（b）PixelLink [20]：预测每个像素，文本/非文本分类以及它是否与相邻像素属于同一文本/（c）角点定位[89]：预测每个文本的四个角和对属于同一文本实例的那些进行分组。（d）TextSnake [85]：预测文本/非文本和局部几何形状，用于重建文本实例。

Fig. 6: Illustration of representative bottom-up methods: (a) SegLink [121]: with SSD as base network, predict word segments at each anchor position, and connections between adjacent anchors. (b) PixelLink [20]: predict for each pixel, text/non-text classification and whether it belongs to the same text as adjacent pixels or not/ (c) Corner Localization [89]: predict the four corners of each text and group those belonging to the same text instances. (d) TextSnake [85]: predict text/non-text and local geometries, which are used to reconstruct text instance.

Components-level methods [40], [89], [121], [133], [140], [171] usually predicts at a medium granularity. Component refer to a local region of text instance, sometimes containing one or more characters.

As shown in Fig.6 (a), SegLink [121] modified the original framework of SSD [76]. Instead of default boxes that represent whole objects, default boxes used in SegLink have only one aspect ratio and predict whether the covered region belongs to any text instances or not. The region is called text segment. Besides, links between default boxes are predicted, indicating whether the linked segments belong to the same text instance.

Corner localization methods [89] proposes to detect the corners of each text instance. Since each text instance only has 4 corners, the prediction results and their relative position can indicate which corners should be grouped into the same text instance.

SegLink [121]和角点定位[89]特别适用于长文本和多方位的文本。我们在此只介绍想法，更多关于如何实现详细论述在特定目标章节介绍。在基于聚类的方法[140]中，像素根据它们的颜色一致性和边缘信息进行聚类。融合图像片段称为超像素。这些超像素进一步用于提取字符和预测文本实例。

组件级方法的另一个分支是连续文本建议网络（CTPN）[133]，[147]，[171]。CTPN模型继承了锚点和循环神经网络的概念用于序列标记。他们通常包括基于CNN的图像分类网络，例如，VGG，并在之上堆叠一个RNN。最终的特征图的每个位置通过对应的锚点表示了区域中的特征。假设文本是水平的，每行特征都被送入RNN并标记为文本/非文本。几何形状也一并预测。

SegLink [121] and Corner localization [89] are proposed specially for long and multi-oriented text.We only introduce the idea here and discuss more details in the section of Specific Targets, regarding how they are realized. In a clustering based method [140], pixels are clustered according to their color consistency and edge information. The fused image segments are called superpixel. These superpixels are further used to extract characters and predict text instance.

Another branch of component-level method is Connectionist Text Proposal Network (CTPN) [133], [147], [171]. CTPN models inherit the idea of anchoring and recurrent neural network for sequence labeling. They usually consist of a CNN-based image classification network, e.g. VGG, and stack an RNN on top of it. Each position in the final feature map represents features in the region specified by the corresponding anchor. Assuming that text appear horizontally, each row of features are fed into an RNN and labeled as text/non-text. Geometries are also predicted.

3.1.3特定目标

当前文本检测系统的另一个特征是其中大部分都是为特殊目的而设计的，尝试实现特别困难的场景文字检测。我们将它们大致分为以下几个方面。

3.1.3 Specific Targets

Another characteristic of current text detection system is that, most of them are designed for special purposes, attempting to approach unique difficulties in detecting scene text. We broadly classify them into the following aspects.

3.1.3.1长文本：与一般对象检测不同，文本通常有不同的宽高比。他们有很多更大的高宽比，从而进行一般物体检测框架会失败。已经提出了几种方法[59]，[89]，[121]，专门用于检测长文本。

3.1.3.1 Long Text: Unlike general object detection, text usually come in varying aspect ratios. They have much larger height-width ratio, and thus general object detection framework would fail. Several methods have been proposed [59], [89], [121], specially designed to detect long text.

R2CNN [59]提供供了一种直观的解决方案，使用不同大小的ROI池化。遵循Faster R-CNN [107]框架，有三个不同的ROI池化大小，7 * 7,3 * 11和11 * 3，对每个由区域提议网络生成的候选框，与池化特征连相结合进行是否文本的打分。

R2CNN [59] gives an intuitive solution, where ROI pooling with different sizes are used. Following the framework of Faster R-CNN [107], three ROI-poolings with varying pooling sizes, 7 *7 , 3*11, and 11* 3, are performed for each box generated by region-proposal network, and the pooled features are concatenated for textness score.

另一个分支学习检测局部子文本组件，它们独立于全文[20]，[89]，[121]。SegLink [121]提出检测组件，即作为文本的正方形区域，以及这些组件相互联系方式。PixelLink [20]预测了哪些像素属于文本以及相邻像素是否属于相同的文本实例。角点定位[89]检测文本角落。所有这些方法都学习检测本地组件然后将它们组合在一起以进行最终检测。

Another branch learns to detect local sub-text components which are independent from the whole text [20], [89], [121]. SegLink [121] proposes to detect components, i.e. square areas that are text, and how these components are linked to each other. PixelLink [20] predicts which pixels belong to any text and whether adjacent pixels belong to the same text instances. Corner localization [89] detects text corners. All these methods learn to detect local components and then group them together to make final detections.

3.1.3.2 多方向文本：

从笼统的文本检测的另一个特定方向来看，文本检测是旋转敏感的，而在现实世界中，歪斜的文本是常见。使用传统的轴对齐预测框将会把嘈杂的背景合并起来，会影响性能以及之后的文本识别模块。已经有提出了几种解决方法[59]，[73]，[74]，[82]，[90]，[121]，[168]，[141] 12。

从一般基于锚的方法扩展，使用旋转默认框[73]，[82]，预测旋转偏移。相似的，旋转区域提案[90]以6种不同的方向生成。基于回归方法[59]，[121]，[168]预测旋转角和位置顶点，对方向不敏感。进一步，Liao等人[74]，结合了旋转过滤器[169]对方向不变性进行建模。权重尺寸为3 * 3的过滤器的围绕中心权重旋转捕获对旋转敏感的特征。

Extending from general anchor-based methods, rotating default boxes [73], [82] are used, with predicted rotation offset. Similarly, rotating region proposals [90] are generated with 6 different orientations. Regression-based methods [59], [121], [168] predict the rotation and positions of vertexes, which are insensitive to orientation. Further, in Liao et al. [74], rotating filters [169] are incorporated to model orientation-invariance explicitly. The peripheral weights of 3 *3 filters rotate around the center weight, to capture features that are sensitive to rotation.

由于上述方法可能需要额外的后处理，王等人[141]建议使用参数化的实例转换网络（ITN）学会预测适当的仿射变换，在基础网络提取的最后一个要素图层上来执行，用于纠正多方向文本的实例。他们的方法，ITN，可以端到端地训练。

While the aforementioned methods may entail additional post-processing, Wang et al. [141] proposes to use a parametrized Instance Transformation Network (ITN) that learns to predict appropriate affine transformation to perform on the last feature layer extracted by the base network, to rectify oriented text instances. Their method, with ITN, can be trained end-to-end.

3.1.3.3不规则形状的文本：除了变化之外宽高比，另一个区别是文本可以有多种形状，例如弯曲的文字。弯曲的文本构成一个新的挑战，因为常规的矩形边界框将包含大部分背景和甚至其他文本实例，使其难以识别。

3.1.3.3 Text of Irregular Shapes: Apart from varying aspect ratios, another distinction is that text can have a diversity of shapes, e.g. curved text. Curved text poses a new challenge, since regular rectangular bounding box would incorporate a large proportion of background and even other text instances, making it difficult for recognition.

从四边形边界框延伸，很自然使用超过4个顶点的边界“框”。[163]是提出了使用多达14个顶点的边界多边形，而后使用bi-lstm [48]层来细化预测顶点的坐标。然而在他们的框架中，最开始时，使用轴对齐的矩形做为中间结果，边界多边形的位置基于此进行预测。

Extending from quadrilateral bounding box, it’s natural to use bounding ’boxes’ with more that 4 vertexes. Bounding polygons [163] with as many as 14 vertexes are proposed, followed by a bi-lstm [48] layer to refine the coordinates of the predicted vertexes. In their framework, however, axis-aligned rectangles are extracted as intermediate results in the first step, and the location bounding

polygons are predicted upon them.

相似的工作，Lyu等人[88]修改Mask R-CNN [42]框架，以便为每个兴趣的区域- 在轴对齐矩形中- 使用字符掩码来预测每一种字母。这些预测的字符将以多边形作为检测结果对齐在一起。值得注意的是，他们提出了他们的方法作为端到端系统。我们将在下面的部分再次参考它。

Similarly, Lyu et al. [88] modifies the Mask R-CNN [42] framework, so that for each region of interest—in the form of axis-aligned rectangles—character masks are predicted solely for each type of alphabets. These predicted characters are then aligned together to form a polygon as the detection results. Notably, they propose their method as an end-to-end system. We would refer to it again in the following part.

Long等人从不同的角度看问题。[85]认为文本可以表示为一系列根据文本实例的运行方向沿文本中心线（TCL）滑动的圆盘，如图7所示。通过新颖的表现，他们呈现一个新的模型，TextSnake，如图6（d）所示，学习预测本地属性，包括TCL /非TCL，textregion /非文本区域，半径和方向。联合使用TCL像素和文本区域像素用于预测最终结果像素级TCL。然后用局部几何形状以有序点列表的形式提取TCL，如图6（d）所示。用TCL和半径，文本线被重建。它在几个弯曲的文本数据集以及更广泛使用一些数据中实现了最先进的性能，例如ICDAR2015 [63]和MSRA-TD500 [135]。值得注意的是，龙等人。提出跨不同数据集的交叉验证测试数据集，其中模型仅在数据集上进行微调直文文本实例，并在弯曲数据集上进行测试。在所有现有的弯曲数据集中，TextSnake实现了比基线的F1-Score提升20％的成绩。

图7：（a）- （c）：将文本表示为水平矩形，定向矩形和四边形。（d）：滑动盘在TextSnake [85]中具有代表性的提出。

Fig. 7: (a)-(c): Representing text as horizontal rectangles, oriented rectangles, and quadrilaterals. (d): The sliding-disk reprensentation proposed in TextSnake [85].

Viewing the problem from a different perspective, Long et al. [85] argues that text can be represented as a series of sliding round disks along the text center line (TCL), which accord with the running direction of the text instance, as shown in Fig.7. With the novel representation, they present a new model, TextSnake, as shown in Fig.6 (d), that learns to predict local attributes, including TCL/non-TCL, textregion/ non-text-region, radius, and orientation. The intersection of TCL pixels and text region pixels gives the final prediction of pixel-level TCL. Local geometries are then

used to extract the TCL in the form of ordered point list, as demonstrated in Fig.6 (d). With TCL and radius, the text line is reconstructed. It achieves state-of-the-art performance on several curved text dataset as well as more widely used ones, e.g. ICDAR2015 [63] and MSRA-TD500 [135]. Notably, Long et al. proposes a cross-validation test across different datasets, where models are only fine-tuned on datasets with straight text instances, and tested on the curved datasets. In all existing curved datasets, TextSnake achieves improvements by up to 20% over other baselines in F1-Score.

3.1.3.4加速：当前文本检测方法更加注重速度和效率，这是在移动设备中应用所必需的。

3.1.3.4 Speedup: Current text detection methods place more emphasis on speed and efficiency, which is necessary for application in mobile devices.

获得显着加速的第一项工作是EAST [168]，这对以前的框架做了一些修改。而不是VGG [129]，EAST使用PVANet [67]作为其基础网络，PVANet 在衡ImageNet竞赛中实现了效率和准确率的很好平衡。此外，它简化了预测网络和非最大抑制步骤。预测网络是一个U形[113]完全卷积网络，映射一个输入图像I 2 RH; W; C到特征图F 2 RH = 4; W = 4; K，其中每个位置f = Fi; j;：2 R1; 1; K是特征描述预测文本实例的向量。那就是顶点或边的位置，方向和中心的偏移量，对应于的文本实例特征位置（i; j）。对应的特征向量将同一文本实例与非最大值合并抑制。它以16.8的FPS实现了最先进的速度以及在大多数数据集的领先性能。

The first work to gain significant speedup is EAST [168], which makes several modifications to previous framework. Instead of VGG [129], EAST uses PVANet [67] as its basenetwork, which strikes a good balance between efficiency and accuracy in the ImageNet competition. Besides, it simplifiesthe whole pipeline into a prediction network and a non-maximum suppression step. The prediction network is a U-shaped [113] fully convolutional network that maps an input image I 2 RH;W;C to a feature map F 2 RH=4;W=4;K, where each position f = Fi;j;: 2 R1;1;K is the feature vector that describes the predicted text instance. That is, the location of the vertexes or edges, the orientation, and the offsets of the center, for the text instance corresponding to that feature position (i; j). Feature vectors that corresponds to the same text instance are merged with the non-maximum suppression. It achieves state-of-the-art speed with FPS of 16.8 as well as leading performance on most datasets

3.1.3.5 简易实例分割：如上所述以上，近年来见证了密集特征的方法预测，即像素级预测[20]，[41]，[103]，[148]。这些方法生成对每个进行分类的预测图像素为文本或非文本。但是，由于文字可能会接近彼此不同的文本实例的像素可以是相邻的在预测图中。因此，分离像素变为重要。

3.1.3.5 Easy Instance Segmentation: As mentioned above, recent years have witnessed methods with dense predictions, i.e. pixel level predictions [20], [41], [103], [148]. These methods generate a prediction map classifying each pixel as text or non-text. However, as text may come near each other, pixels of different text instances may be adjacent in the prediction map. Therefore, separating pixels become important.

[41]提出像素级文本中心线，因为中心线相距较远。在[41]中，预测图用于表示文本行。这些文字行因为它们不相邻可以容易分开。为了生成文本实例的预测，文本实例的的二进制文本中心线附加到原始输入图像进入分类网络。生成显着性掩码指示检测到的文本。但是，这种方法涉及几个步骤。文本行生成步骤和最终预测步骤无法进行端到端训练和梯度传播。

Pixel-level text center line is proposed [41], since the center lines are far from each other. In [41], a prediction map indicating text lines is predicted. These text lines can be easily separated as they are not adjacent. To produce prediction for text instance, a binary map of text center line of a text instance is attached to the original input image and fed into a classification network. A saliency mask is generated indicating the detected text. However, this method involves several steps. The text-line generation step and the final prediction step can not be trained end-to-end, and error propagates.

分隔不同文本实例的另一种方法是使用边界学习的概念[103]，[148]，[149]，其中每一个像素被分为三类之一：文本，非文本，和文字边框。文本边框将文本像素分隔到属于不同的实例。同样，在薛等人的工作中。[149]，文本被认为是4片段，即一对长边（腹部和背部）和一对短边（头部和尾部）。方法薛等人也是第一个使用DenseNet [51]作为他们的人basenet，在它被评估了在所有数据集[43]，使用ResNet实现了一致的F1值2 - 4％性能提升。

Another way to separate different text instances is to use the concept of border learning [103], [148], [149], where each pixel is classified into one of the three classes: text, non-text, and text border. The text border then separates text pixels that belong to different instances. Similarly, in the work of Xue et al. [149], text are considered to be enclosed by 4 segments, i.e. a pair of long-side borders (abdomen and back) and a pair of short-side borders (head and tail). The method of Xue et al. is also the first to use DenseNet [51] as their basenet, which provides a consistant 2 - 4% performance boost in F1-score over that with ResNet [43] on all datasets that it’s evaluated on.

遵循SegLink的链接理念，PixelLink [20]学习链接属于同一文本实例的像素。通过不同的算法有效地实现文本像素被分类为不同实例组。同样的方式，刘等人。[84]提出了一种预测方法马尔科夫相邻像素的组成聚类[137]，而不是神经网络。马尔可夫聚类算法应用于输入图像的特征图，由神经网络预测每个像素是否属于任何文本实例或不。然后，聚类结果给出分段文本实例。

Following the linking idea of SegLink, PixelLink [20] learns to link pixels belonging to the same text instance. Text pixels are classified into groups for different instances efficiently via disjoint set algorithm. Treating the task in the same way, Liu et al. [84] proposes a method for predicting the composition of adjacent pixels with Markov Clustering [137], instead of neural networks. The Markov Clustering algorithm is applied to the saliency map of the input image, which is generated by neural networks and indicates whether each pixel belongs to any text instances or not. Then, the clustering results give the segmented text instances.

3.1.3.6检索指定文本：与经典的场景文字检测不同，有时我们想要在给定描述的情况下检索某个文本实例。荣等人提出[112]用于检索指定文本的多编码器框架。具体而言，根据需要检索文本自然语言查询。多编码器框架包括密集文本本地化网络（DTLN）和上下文推理文本检索（CRTR）。DTLN使用LSTM将FCN网络中的特征解码为一系列文本实例。CRTR对查询和场景文本图像进行特征编码，以对DTLN生成的区域的候选文本进行排名。就我们而言，这是第一个根据查询检索文本的工作。

3.1.3.6 Retrieving Designated Text: Different from the classical setting of scene text detection, sometimes we want to retrieve a certain text instance given the description. Rong et al. [112] a multi-encoder framework to retrieve text as designated. Specifically, text is retrieved as required by a natural language query. The multi-encoder framework includes a Dense Text Localization Network (DTLN) and a Context Reasoning Text Retrieval (CRTR). DTLN uses an LSTM to decode the features in a FCN network into a sequence of text instance. CRTR encodes the query and the features of scene text image to rank the candidate text regions generated by DTLN. As much as we are concerned, this is the first work that retrieves text according to a query.

3.1.3.7 针对复杂背景：注意机制被引入以抑制复杂的背景[44]。主干网络类似于标准SSD框架预测文字框，除此之外它在其级联特征图上应用了inception 块，获得所谓的聚合初始特征（AIF）。基于初始块再添加一个附加的文本注意模块。注意集中到了AIF，可以减少嘈杂的背景影响。

3.1.3.7 Against Complex Background: Attention mechanism is introduced to silence the complex background [44]. The stem network is similar to that of the standard SSD framework predicting word boxes, except that it applies inception blocks on its cascading feature maps, obtaining what’s called Aggregated Inception Feature (AIF). An additional text attention module is added, which is again based on inception blocks. The attention is applied on all AIF, reducing the noisy background.