An automatic table recognition method for interpretation of tabular data in document images majorly involves solving two problems of table detection and table structure recognition. The prior work involved solving both problems independently using two separate approaches. More recent
works signify the use of deep learning-based solutions while also attempting to design an end to end solution. In this paper, we present an improved deep learning-based end to end approach for solving both problems of table detection and structure recognition using a single Convolution Neural Network (CNN) model. We propose CascadeTabNet: a Cascade mask Region-based CNN High-Resolution Network (Cascade mask R-CNN HRNet) based model that detects the regions of tables and recognizes the structural body cells from the detected tables at the same time. We evaluate our results on ICDAR 2013, ICDAR 2019 and TableBank public datasets. We achieved 3rd rank in ICDAR 2019 post-competition results for table detection while attaining the best accuracy results for the ICDAR 2013 and TableBank dataset. We also attain the highest accuracy results on the ICDAR 2019 table structure recognition dataset. Additionally, we demonstrate effective transfer learning and image augmentation techniques that enable CNNs to achieve very accurate table detection results. Code and dataset has been made available at: https://github.com/
DevashishPrasad/CascadeTabNet.
一种用于文档图像中表格数据解释的自动表格识别方法,主要解决了表格检测和表格结构识别两个问题。之前的工作包括使用两种不同的方法独立地解决这两个问题。最近的研究在尝试使用深度学习的方式去解决种问题。本文提出了一种改进的基于深度学习的段导弹的方法,利用单一的卷积神经网络模型来解决表格检测和结构识别的问题,我们提出了级联网络:一种基于级联掩膜区域的CNN高分辨率网络(Cascade mask R-CNN HRNet)模型检查表的区域,同时从检测的表中识别结构体信息。我们评估了2013,2019的ICDAR和TableBank公共数据集的结果。我们在ICDAR 2019表格检测结果中排名第三,同时获得了2013年ICDAR和TableBank数据集的最佳精度结果。代码和数据可以在https://github.com/DevashishPrasad/CascadeTabNet使用。
- Introduction
The world is changing and going digital. The use of digitized documents instead of physical paper-based documents is growing rapidly. These documents contain a variety of table-based information with variations in appearance and layouts. An automatic table information extraction method
involves two subtasks of table detection and table structure recognition. In table detection, the region of the image that contains the table is identified while table structure recognition involves identification of the rows and columns to identify individual table cells. The prior proposed approaches
solved these two sub-problems independently.
世界正在发生变化,正在走向数字化,数字化文档取代纸质文档的使用正在迅速增长。这些文档包含各种基于表格的信息,在外观和布局上有所不同。一种表格信息自动提取的方法包括表格检测和表格结构识别的两个子任务。在表检测中,包含标的图像区域被标识,而表结构识别涉及到行和列的标识,以标识单个表单元。先前提出的方法独立地解决了这两个子问题。
In this paper, we propose CascadeTabNet, an improved deep learning-based end to end approach for solving the two sub-problems using a single model. The problem of table detection is solved using instance segmentation. We perform table segmentation on each image where we try to identify each instance of the table within the image at the pixel level. Similarly, we perform table cell segmentation on each image to predict segmented regions of table cells within each table to identify the structure of the table. Table and cell regions are predicted in a single inference (at the same time) by the model. Simultaneously, the model classifies tables into two types as bordered (ruling-based) and borderless (no rulingbased) tables. The model predicts the segmentation of cells only for the unbordered tables. We use a simple rule-based conventional text detection and line detection algorithms for extracting cells from bordered tables。
在本文中,我们提出了一种改进的基于深度学习的端到端方法来解决这两个子问题。通过实例分割解决了表检测问题。我们在每个图像上执行表分割,尝试在像素级别识别图像中的每个表实例。同样,我们对每个图像执行表单元分割,以预测每个表中表单元的分割区域,以识别表的结构。该模型在一次推理中(同时)预测表和单元格区域。同时,该模型将表格分为有边界(基于规则)和无边界(无规则)两类。该模型只预测未排序表的单元分割。我们使用一种简单的基于规则的传统文本检测和行检测算法从带边框的表中提取单元格。
We demonstrate the effectiveness of iterative transfer learning to make the CNN learn from less amount of training data as well as enable it to perform well on multiple datasets by fine-tuning it on respective datasets. A new way of image augmentation was also implanted into the training process to enhance the accuracy of table detection and helping it learn more effectively.
我们证明了迭代转移学习的有效性,使CNN从较少的训练数据中学习,并且通过对各自的数据集进行微调,使其能够在多个数据集上表现良好。在训练过程中还引入了一种新的图像增强方法,以提高表格检测的准确性,帮助其更有效地学习。
Evaluation for table detection task was performed on three public datasets of the ICDAR 2013, ICDAR 2019 competition (Track A) dataset and TableBank dataset. We achieve 3rd rank in post-competition results of ICDAR 2019 for table detection. We achieve the highest accuracy for table detection task on ICDAR 2013 dataset and all of the three subsets of the TableBank dataset. For table structure recognition tasks we evaluate the model on ICDAR 2019 dataset (Track B2) and achieve the highest rank in postcompetition results.
对ICDAR 2013、ICDAR 2019竞赛(Track A)数据集和TableBank数据集的三个公共数据集进行了表检测任务评估。我们在2019年ICDAR赛后成绩表检测中排名第三。我们在ICDAR 2013数据集和表库数据集的三个子集上实现了最高精度的表检测任务。对于表结构识别任务,我们在ICDAR 2019数据集(轨道B2)上评估模型,并在竞争后结果中获得最高排名。
Our main contributions made in this paper are as per the following:
本文的主要贡献如下:- We propose CascadeTabNet: an end-to-end deeplearning-based approach that uses the Cascade Mask R-CNN HRNet model for both table detection and structure recognition.
- We show that the proposed image transformation techniques for image augmentation for training enhances the table detection accuracy significantly.
- We perform a comparative analysis of various CNN models for the table detection task in which the Cascade Mask R-CNN HRNet model outperforms other models.
- We demonstrate an effective iterative transfer learningbased methodology that helps the model to perform well on different types of datasets using a small amount of training data.
- We manually annotated some of the ICDAR 19 dataset images for table cell detection in borderless tables while also categorizing tables into two classes (bordered and borderless) and will be releasing the annotations to the community.
我们提出了CascadeTabNet:一种基于端到端深度学习的方法,它使用级联掩码R-CNN HRNet模型来进行表检测和结构识别。
结果表明,本文提出的用于训练图像增强的图像变换技术显著提高了表格检测的精度。
我们对各种CNN模型进行了表检测任务的比较分析,其中级联掩模R-CNN-HRNet模型优于其他模型。
我们展示了一种有效的基于迭代转移学习的方法,它可以帮助模型使用少量的训练数据在不同类型的数据集上运行良好。
我们手动为ICDAR19数据集图像添加注释,用于无边界表中的表单元格检测,同时还将表分为两类(有边界和无边界),并将向社区发布注释。
- Related work
In 1997, P. Pyreddy and, W. B. Croft [19] was the first to propose an approach of detecting tables using heuristics like a Character Alignment, holes and gaps. To improve accuracy, Wonkyo Seo et al. [22] used the Junctions (intersection of the horizontal and vertical line) detection with some post-processing. T. Kasar et al. [15] also used the junction detection, but instead of heuristics, they passed the junction information to SVM.
1997年,P.Pyreddy和W.B.Croft[19]第一个提出了一种使用诸如字符对齐、孔和间隙等启发式方法来检测表的方法。为了提高准确性,Wonkyo Seo等人。[22]使用交叉点(水平线和垂直线的交点)检测和一些后处理。T、 Kasar等人。[15] 他们还使用了交叉点检测,但是没有使用启发式方法,而是将交叉点信息传递给支持向量机。
With the ascent of Deep Learning and object detection, Azka Gilani et al. [9] was the first to propose a Deep learningbased approach for Table Detection by using Faster R-CNN based Model. They also attempted to improve the accuracy of models by introducing distance-based augmentation to detect tables. Some approaches tried to utilize the semantic information, Such as S. Arif and F. Shafait [1] attempted to improve the accuracy of Faster R-CNN by using semantic color-coding of text and Dafang He et al. [12], used FCN for semantic page segmentation with an end verification network is to determine whether the segmented part is the table or not.
随着深度学习和目标检测技术的发展,Azka Gilani等人。[9] 第一个提出了一种基于深度学习的表检测方法,它使用了更快的基于R-CNN的模型。他们还试图通过引入基于距离的增广来检测表来提高模型的准确性。一些尝试利用语义信息的方法,如S.Arif和F.Shafait[1]试图通过文本的语义颜色编码来提高快速R-CNN的准确性。[12] 使用FCN进行语义页面分割时,通过一个端验证网络来判断被分割的部分是否是表。
In 1998, Kieninger and Dengel [16], proposed the initial approach for Table Structure Recognition by clubbing the text into chunks and dividing those chunks into cells based on the column border. Tables have many basic objects such as lines and characters. Waleed Farrukh et al. [7], used a
bottom-up heuristic-based approach on these basic objects to construct the cells. Zewen, Chi et al. [5] proposed a graphbased approach for table structure recognition in which they used the SciTSR dataset constructed by themselves for training the GraphTSR model.
1998年,Kieninger和Dengel[16]提出了表格结构识别的初步方法,将文本分为块,并根据列边界将这些块划分为单元。表有许多基本的对象,如行和字符。Waleed Farrukh等人。[7] ,对这些基本对象使用了一种基于自底向上的启发式方法来构造单元。Zewen,Chi等。[5] 提出了一种基于graph的表结构识别方法,利用自己构建的SciTSR数据集训练GraphTSR模型。
Sebastian Schreiber et al. [21] were the first to perform table detection and structure recognition together with a 2 fold system which Faster RCNN for table detection and, Subsequently, deep learning-based semantic segmentation for table structure recognition. To make the model more generalize, Mohammad Mohsin et al. [20] used a combination of GAN based architecture for table detection and SegNet based encoder-decoder architecture for table structure segmentation.
Sebastian Schreiber等人。[21]是第一个执行表检测和结构识别的公司,它采用了一个2倍系统,该系统加快了表检测的RCNN速度,随后,基于深度学习的语义分割用于表结构识别。为了使模型更具普遍性,Mohammad Mohsin等人。[20] 采用了基于GAN的表检测体系结构和基于SegNet的编解码结构相结合的表结构分割方法。
Recently, Shubham Paliwal et al. [18], was first to propose a deep learning-based end-to-end approach to perform table detection and column detection using encoder-decoder with the VGG-19 as a base semantic segmentation method, where the encoder is the same and decoder is different for both tasks. After detection results for the table are obtained from the model, the rows are extracted from the table region using a semantic rule-based method. This approach uses a Tesseract OCR engine for text location.
最近,Shubham Paliwal等人。[18] ,首次提出了一种基于深度学习的端到端方法,以VGG-19为基本语义分割方法的编解码器实现表检测和列检测。从模型中得到表的检测结果后,采用基于语义规则的方法从表区域中提取行。此方法使用Tesseract OCR引擎进行文本定位。
三 CascadeTabNet: The presented approach
We try to focus on using a small amount of data effectively to achieve high accuracy results. Working towards this goal, our primary strategy includes : - Using a relatively complex but efficient CNN architecture that attains high accuracy on object detection and segmentation benchmarking datasets as the main component in the approach.
- Using an iterative transfer learning approach to train the CNN model gradually, starting from more general tasks and going towards more specific tasks. Performing iterations of transfer learning multiple times to extract the needful knowledge effectively from a small amount of data.
- Strengthening the learning process by applying image transformation techniques to training images for data augmentation.
We elaborate on the strategies in the following subsections and explain the pipeline of the approach.
三 级联表网:提出的方法
我们致力于有效地使用少量的数据来获得高精度的结果。为实现这一目标,我们的主要战略包括:
1采用一个相对复杂但高效的CNN体系结构,在目标检测和分割基准数据集上获得高精度,作为该方法的主要组成部分。
2使用迭代转移学习方法逐步训练CNN模型,从更一般的任务开始,向更具体的任务发展。多次进行迁移学习迭代,有效地从少量数据中提取所需知识。
3.通过将图像转换技术应用于训练图像以进行数据增强,从而加强学习过程。
我们将在下面的小节中详细说明策略,并解释该方法的管道。
To attain very high accuracy results we use a model that was made by the combination of two approaches. Cascade RCNN was originally proposed by Cai and Vasconcelos [2] to solve the paradox of high-quality detection in CNNs by introducing a multi-stage model. And a modified HRNet was proposed by Jingdong Wang et al. [25] to attain reliable highresolution representations and multi-level representations for semantic segmentation as well as for object detection.Our experiments and analysis show that the cascaded multistaged model with the HRNet backbone network yields the best results due to the ability of both the approaches to strive for high accuracy object segmentation.
为了获得非常高精度的结果,我们使用了由两种方法组合而成的模型。级联RCNN最初是由Cai和Vasconcelos[2]提出的,通过引入多阶段模型来解决CNNs中高质量检测的悖论。王敬东等人提出了一种改进的人力资源网络。[25]为语义切分和对象获取可靠的高分辨率表示和多级表示检测。我们的实验和分析表明,基于HRNet骨干网的级联多阶段模型由于具有追求高精度目标的能力而得到了最好的结果细分市场。
The original architecture of HRNet [14] (HRNetV1) was enhanced for semantic segmentation to form HRNetV2 [25]. And, then a feature pyramid was formed over HRNetV2 for object detection to form HRNetV2p [25]. CascadeTabNet is a three-staged Cascade mask R-CNN HRNet model. A
backbone such as a ResNet-50 without the last fully connected layer is a part of the model that transforms an image to feature maps. CascadeTabNet uses HRNetV2p W32 [25] (32 indicates the width of the high-resolution convolution) as the backbone for the model.
对HRNet的原有体系结构(HRNetV1)进行了语义分割,形成了HRNetV2[25]。然后在HRNetV2上形成一个特征金字塔,用于目标检测,形成HRNetV2p[25]。CascadeTabNet是一个三级级联掩模R-CNN-HRNet模型。像ResNet-50这样没有最后一个完全连接层的主干网是将图像转换为特征映射的模型的一部分。CascadeTabNet使用HRNetV2p W32[25](32表示高分辨率卷积的宽度)作为模型的主干。
The architecture strategy of the Cascade mask R-CNN [3] is very similar to the Cascade R-CNN [2]. The Cascade R-CNN architecture is extended to the instance segmentation task, by attaching a segmentation branch as done in the Mask R-CNN [13]. To explain the model architecture we try to use the naming conventions similar to that of the Mmdetection framework [4]. As shown in figure 1, the image ”I” is fed into the model. The backbone CNN HR NetV2p W32 transforms the image ”I” to feature maps. The ”RPN Head” (Dense Head) predicts the preliminary object proposals for these feature maps. The ”Bbox Heads” take RoI features as input and make RoI-wise predictions. Each head makes two predictions as bounding box classification scores and box regression points. ”B” denotes the bounding boxes predicted by the heads and, for simplicity, we do not show the classification scores in the figure. The ”Mask Head” predicts the masks for the objects and ”S” denotes a segmentation output. At the inference, object detections made by ”Bbox Heads” are complemented with segmentation masks made by ”Mask Head”, for all detected objects.
级联掩模R-CNN[3]的结构策略与级联R-CNN[2]非常相似。级联R-CNN架构被扩展到实例分段任务中,方法是附加一个分段分支,如Mask R-CNN[13]中所做的那样。为了解释模型体系结构,我们尝试使用与Mmdetection框架相似的命名约定[4]。如图1所示,图像“I”被输入到模型中。主干CNN-HR-NetV2p W32将图像“I”转换为特征地图。“RPN头”(密集头)预测这些特征地图的初步目标建议。“Bbox Heads”将RoI特性作为输入,并对RoI进行明智的预测。每个头部做两个预测作为边界盒分类分数和盒回归点。“B”表示头部预测的边界框,为了简单起见,我们在图中不显示分类分数。“Mask Head”预测对象的掩码,“S”表示分段输出。在此基础上,利用“Bbox-Heads”所做的目标检测与“Mask-Head”的分割掩码相补充,对所有检测到的对象进行分割。
For image segmentation using the Cascade R-CNN, Cai and Vasconcelos [3] propose multiple strategies in which the segmentation branch is placed at various stages of the network. CascadeTabNet utilizes the strategy of adding the segmentation branch at the last stage of the Cascade R-CNN. The model was implemented using the MMdetection toolbox [4]. We use the default implementation (cascade mask rcnn hrnetv2p w32 20e) of the model for our experiments and analysis.
对于使用级联R-CNN的图像分割,Cai和Vasconcelos[3]提出了多种策略,将分割分支放置在网络的各个阶段。CascadeTabNet采用了加法的策略级联R-CNN最后一级的分割分支。该模型是使用MMdetection工具箱实现的[4]。我们使用模型的默认实现(级联掩码rcnn hrnetv2p w32 20e)实验和分析。
3.2. Iterative transfer learning
Both the tasks involve object segmentation, and we use a multi-task learning approach as well as multiple iterations of transfer-learning to achieve our goal. In short, we first train our model on a general dataset and then fine-tune it multiple times for specific datasets. More precisely, we use two iterations of transfer learning and so we call this approach as two-stage transfer learning.
这两个任务都涉及对象分割,我们使用多任务学习方法和多次迭代的转移学习来实现oal。简而言之,我们首先在一般数据集上训练模型,然后针对特定的数据集对其进行多次微调。更准确地说,我们使用了两次迁移学习的迭代,因此我们将这种方法称为两阶段迁移学习。
First, we create a general dataset for a general task of table detection. We add images of different types of documents like word and latex in this dataset. These documents contain tables of various types like bordered, semi-bordered and borderless. A bordered table is one for which an algorithm can use just the line positions to estimate the cells and overall structure of the table. If some of the lines are missing, it becomes difficult for a line detection based algorithm to separate the adjacent cells of the table. We call such a table as a semi-bordered table, in which some lines are not present. And a borderless table is one which doesn’t have any lines. Detecting only the tables in images is a general task for an algorithm, but detecting them according to their types is a specific task. For example, detecting dogs in images is a general task, but detecting only the bulldogs and pugs is a more specific task that requires relatively more data by the model. To make it a general task for table recognition, initially, all these tables in the images are annotated as of one class (the table class), which enables the model to learn common and general features to detect tables. The trained model can use this knowledge to learn even more specific tasks like table detection according to their types.
首先,我们为表检测的一般任务创建一个通用数据集。我们在这个数据集中添加了不同类型文档的图像,如word和latex。这些文档包含各种类型的表,如带边框、半边框和无边框。带边框表是一种算法可以只使用行位置来估计单元格和表的整体结构的表。如果某些行丢失,基于行检测的算法很难分离表中的相邻单元格。我们称这种表为半边界表,其中有些行不存在。无边界表是没有任何线条的表。只检测图像中的表是一种算法的一般任务,而根据表的类型进行检测则是一项特殊的任务。例如,在图像中检测狗是一项一般性的任务,但是只检测斗牛犬和哈巴狗是一项更具体的任务,需要模型提供相对更多的数据。为了使表识别成为一项常规任务,最初,图像中的所有这些表都被注释为一个类(table类),这使模型能够学习用于检测表的通用和常规功能。经过训练的模型可以使用这些知识根据类型学习更具体的任务,如表检测。
The two-stage transfer learning strategy is used to make a single model learn end to end table recognition using a small amount of data. In this strategy, transfer learning is practiced two times on the same model. Detecting tables in images becomes a specific task for a CNN model that was earlier trained on a dataset with hundred-thousands of images to detect objects from thousand classes. So in the first iteration of transfer learning, we initialize our CNN model with the pre-trained imagenet coco model weights before training. It enables the CNN model to learn only task- specific higher-level features while getting some advantages like the lesser need for training data and reducing total training time due to beforehand knowledge. After training, CNN successfully predicts the table detection masks for tables in the images. Similarly, in the second iteration, the model is again fine-tuned on a smaller dataset to accomplish even more specific task of predicting the cell masks in borderless tables along with detecting tables according to their types. Another challenging and specific task can be table detection for a particular type of document images (latex documents). We do not freeze any of the layers in the model at any stage while performing iterative transfer learning.
采用两阶段转移学习策略,利用少量数据,使单个模型学习端到端表识别。在这一策略中,迁移学习在同一模型上进行两次。检测图像中的表成为CNN模型的一项特定任务,该模型先前在一个包含数十万个图像的数据集上训练,以检测来自上千个类的对象。因此,在转移学习的第一次迭代中,我们在训练前使用预先训练好的imagenetcoco模型权重初始化CNN模型。它使CNN模型只学习特定于任务的高级特征,同时获得了一些优点,如对训练数据的需求较少,以及由于预先知道而减少了总的训练时间。经过训练,CNN成功地预测了图像中表的检测掩码。类似地,在第二次迭代中,模型再次在较小的数据集上进行微调,以完成更具体的任务,即预测无边界表中的单元掩码,并根据表的类型检测表。另一个具有挑战性和特殊性的任务是针对特定类型的文档图像(latex文档)进行表检测。在执行迭代转移学习时,我们不会在任何阶段冻结模型中的任何层。
For the task of table structure recognition, which involves predicting the cell masks in borderless tables along with detecting the different types of tables, we create a smaller dataset. It contains lesser images than that for table detection. This new dataset contains slightly advanced annotations intimating the model to detect tables of two types with their labels (two classes) as bordered and borderless, as well as predict borderless table cell masks (total three classes). We put borderless and semi-bordered tables in one class, the borderless class. We put semi-bordered tables in borderless class because we cannot use only line information to extract cells out of it. We need cell predictions for semi-bordered tables from the model. After again fine-tuning the model on this dataset, it successfully detects tables with their type and also predicts segmentation masks for table body and cells for borderless tables with very high accuracy.
对于表结构识别的任务,包括预测无边界表中的单元掩码以及检测不同类型的表,我们创建了一个更小的数据集。它包含的图像比用于表检测的图像要少。这个新的数据集包含稍微高级的注释,提示模型检测两种类型的表,它们的标签(两个类)是有边界的和无边界的,并预测无边界的表单元掩码(总共三个类)。我们把无边界和半边界表放在一个类中,无边界类。我们将半边界表放在无边界类中,因为我们不能只使用行信息从中提取单元格。我们需要模型中半边界表的单元格预测。在再次对该数据集进行模型微调后,它成功地检测出具有类型的表,并以非常高的精度预测无边界表的表体和单元格的分段掩码。
This strategy worked effectively because while doing the knowledge transfer between two tasks the domains of both the tasks were the same. If domains of two tasks are different, for example, training a model to detect dogs in images and then using the same model to detect different types of horses,
then it may result in a negative transfer. Figure 2 shows the figurative explanation to two-staged transfer learning where the same model is trained iteratively from general to a more specific task, reducing the size of the dataset as we move down.
这种策略之所以有效,是因为在两个任务之间进行知识转移时,两个任务的领域是相同的。如果两个任务的域不同,例如,训练一个模型来检测图像中的狗,然后使用同一个模型来检测不同类型的马,然后可能导致负迁移。图2显示了两阶段转移学习的形象解释,在这种情况下,相同的模型从一般任务迭代训练到更具体的任务,从而在向下移动时减小数据集的大小。
3.3. Image Transformation and data augmentation
Providing a large amount of training data can easily produce deep-learning-based models that can attain very high accuracy results. Adding more training data also prevents models from over-fitting to the training data. For this concern, we try to implement image-augmentation techniques on the original training images to increase the size of training data. But, not all of these techniques would be very effective for augmenting document images. For example, the use of shear and rotation transformations won’t be an effective strategy because the digital documents in the datasets are perfectly axis-aligned. We try to implement the techniques that will help the model to learn more accurately.
3.3、图像变换与数据增强
提供大量的训练数据可以很容易地产生基于深度学习的模型,可以获得非常高的精度结果。添加更多的训练数据还可以防止模型对训练数据过度拟合。为此,我们尝试在原始训练影像上实施影像增强技术,以增加训练资料的大小。但是,并不是所有这些技术都能有效地增强文档图像。例如,使用剪切和旋转变换将不会有效策略,因为数据集中的数字文档完全是轴对齐的。我们试图实现有助于模型更准确地学习的技术。
Documents have text or content regions and blank spaces in them. As the text elements are very small in documents and the proposed model was used for detecting real-world objects in images, we try to make the contents better understandable to the object segmentation model by thickening the text regions and reducing the regions of the blank space. We propose image transformation techniques that help the model to learn more efficiently. The transformed images are added in the original dataset, which also increases the amount of relevant training data for the model.
文档中有文本或内容区域和空格。由于文本元素在文档中是非常小的,并且所提出的模型用于检测图像中的真实世界对象,所以我们试图通过对文本区域进行加粗和缩小空白区域来使内容更易于理解到对象分割模型。我们提出图像变换技术,帮助模型更有效地学习。将变换后的图像加入到原始数据集中,增加了模型的相关训练数据量。
We propose two types of image transformation techniques as Dilation transform and Smudge transform.
提出了两种图像变换技术:膨胀变换和模糊变换。
3.3.1 Dilation transform
In the dilation transform, we transform the original image to thicken the black pixel regions. We convert the original images into binary images before applying the dilation transform. Figure 3, a) is the original image and b) is the transformed dilated image. A 2x2 kernel filter for one iteration was applied to the binary image to generate the transformed image. Experiments showed that the kernel size of 2x2 gave better results.
3.3.1膨胀变换
在膨胀变换中,我们对原始图像进行变换,使黑色像素区域变厚。在应用膨胀变换之前,我们先将原始图像转换成二值图像。图3,a)是原始图像,b)是变换后的放大图像。对二值图像应用2x2核滤波器进行一次迭代,生成变换后的图像。实验结果表明,2x2大小的核具有更好的效果。
3.3.2 Smudge transform
In the smudge transform, we transform the original image to spread the black pixel regions and make it look like a kind of smeary blurred black pixel region. The original images are converted into binary images before the smudge transform is applied. In Figure 3, a) is the original image and c) is the transformed smudged image. Smudge transform is implemented using various distance transforms. The original algorithm is described by Gilani et al. [9] that applies Euclidean Distance Transform, Linear Distance Transform, and Max Distance Transform to the image. Also, some additional normalization and parameter tuning enhanced the results.
3.3.2涂抹变换
在模糊变换中,我们对原始图像进行变换,使其成为一种模糊的黑色像素区域。在应用涂抹变换之前,将原始图像转换为二值图像。在图3中,a)是原始图像,c)是变换后的模糊图像。使用各种距离变换实现了模糊变换。Gilani等人描述了原始算法。[9] 应用欧几里得距离变换,线性距离变换,并对图像进行最大距离变换。另外,一些附加的规范化和参数调整增强了结果。
3.4. Pipeline
In this section, we describe various stages in the pipeline of the CascadeTabNet end to end system for table recognition.
3.4、管道
在这一节中,我们将描述用于表识别的cascadatabnet端到端系统的流水线中的各个阶段。
Figure 4, shows the block diagram of the pipeline. The two-stage fine-tuned CasacdeTabNet model takes in the image of the document containing zero or more tables. It predicts the segmentation masks for tables of two types as bordered and borderless, as discussed earlier. Next in the pipeline, we have separate branches for bordered and borderless tables. Depending on the type of the detected table it is further processed by its respective branch post-processing module. Post-processing modules perform trivial tasks of arranging and cleaning the outputs of the model.
图4显示了管道的框图。两阶段微调的CasacdeTabNet模型接受包含零个或多个表的文档的图像。它预测两种类型表的分段掩码如前所述,有边界和无边界。接下来,我们将为有边界表和无边界表提供单独的分支。根据检测到的表的类型,它由其各自的分支后处理模块进一步处理。后处理模块执行安排和清理模型输出的琐碎任务。
In the borderless branch, we arrange the predicted cells detected inside the table into rows and columns based on their positions. We estimate the missing table lines using the positions of identified rows and columns. Based on these lines, for undetected cells, we detect cells using a contourbased text detection algorithm. And finally, Row-span and Col-span cells are also identified after estimating the lines.
在无边界分支中,我们根据表中检测到的预测单元格的位置将它们排列成行和列。我们使用标识的行和列的位置来估计缺失的表行。基于这些线条,对于未被检测到的细胞,我们使用基于轮廓的文本检测算法来检测细胞。最后,在对直线进行估计后,识别出行跨度和列跨度单元。
In the bordered branch, a conventional algorithm of line detection is used to detect lines of bordered tables. The cells are identified using the line intersection points. And within each cell, the text regions are detected by using the contourbased text detection algorithm. We prefer not to train our model for bordered table cell segmentation masks prediction because using the line information from bordered tables is much easier and efficient to recognize the cells.
在带边框的分支中,使用传统的行检测算法来检测带边框表的行。使用直线交点来标识单元。在每个单元内,使用基于轮廓的文本检测算法检测文本区域。我们不愿意训练我们的模型来进行边界表单元分割掩模预测,因为使用边界表的行信息更容易和有效地识别单元。 - Dataset Preparation
For creating a General dataset for table detection task we merge three datasets of ICDAR 19 (cTDaR)[8], Marmot [6] and Github 1 [23].
The cTDaR competition aims at benchmarking state-of-the-art table detection (TRACK A) containing two subsets of the dataset as Modern and Archival, further described in [8]. We include only the modern subset of this dataset in the general dataset. This subset contains several images of word and latex documents, having text in English and Chinese languages. We also include a publicly available Marmot dataset published by the Institute of Computer Science and Technology of Peking University, further described in [6]. Marmot dataset holds two subsets as Chinese and English, we include both sets in the general dataset. As done by DeepDeSRT [21], to achieve the best possible results, we removed the errors in the ground-truth annotations of the dataset. And finally, we also include a dataset from the internet [23] in the general dataset that contains only borderless table images with some magazine and newspaper based document images. This dataset was also cleaned like the marmot dataset. The General dataset contains a total of 1934 images having 2835 tables in it, and we use this dataset to train a General model.
4数据集准备
为了创建一个用于表检测任务的通用数据集,我们合并了icdar19(cTDaR)[8],Marmot[6]和github1[23]的三个数据集。cTDaR竞赛旨在将最先进的表检测(TRACK A)作为基准,其中包含两个数据集子集,即现代数据集和存档数据集,详见[8]。我们只在通用数据集中包含此数据集的现代子集。该子集包含word和latex文档的多个图像,其中包含英文和中文文本。我们还包括了北京大学计算机科学与技术研究所发布的一个公开可用的旱獭数据集,详见[6]。旱獭数据集包含中文和英文两个子集,我们将这两个集合都包含在通用数据集中。正如DeepDeSRT[21]所做的那样,为了获得尽可能好的结果,我们去除了数据集的基本真相注释中的错误。最后,我们还将一些基于互联网的图像数据集[包括来自报纸的无边界数据集]包含在23个基于互联网的数据集中。这个数据集也像旱獭数据集一样被清理。通用数据集共包含1934幅图像,其中包含2835个表,我们使用该数据集训练通用模型。
For the preliminary analysis of image augmentation, we created four training sets. The first set contains the original images. The second set is created by applying the dilate-transform to all the images in the original set and adding them in the set along with corresponding original images. Similarly, the third set is created by applying the smudge-transform to these original images. And the last set is created by adding the smudged, dilated and original images altogether in the set. In Section 5. we perform a rigorous analysis of these training sets by training different types of models. We show the effectiveness of augmentation techniques, as it boosts the models’ performance.
为了初步分析图像增强,我们创建了四个训练集。第一组包含原始图像。第二个集合是通过对原始集合中的所有图像应用扩张变换并将它们与相应的原始图像一起添加到集合中来创建的。类似地,第三组是通过对这些原始图像应用涂抹变换来创建的。最后一组是通过在集合中添加污点、放大和原始图像来创建的。在第5节中。通过训练不同类型的模型,我们对这些训练集进行了严格的分析。我们展示了增强技术的有效性,因为它提高了模型的性能。
To evaluate the model on the ICDAR 19 (Track A Modern) competition dataset, we perform the dilate image transform for all the images of the Track A Modern dataset. And then fine-tune the General model on it.
为了在icdar19(Track A Modern)竞赛数据集上评估模型,我们对现代赛道数据集的所有图像进行了扩展图像变换。然后对它的一般模型进行微调。
For testing all of the aforementioned datasets, we use the test set of the ICDAR 19 dataset (Track A Modern). We find this set robust and ideal for testing because it contains all types of images like Latex and Word, having all types of tables.
为了测试上述所有数据集,我们使用icdar19数据集的测试集(tracka Modern)。我们发现这个集合是健壮的,非常适合测试,因为它包含所有类型的图像,比如Latex和Word,有所有类型的表。
We also provide evaluation results on the TableBank dataset [17]. TableBank dataset is a new image-based table detection and recognition dataset that contains Word and Latex documents based 417K table images. The table detection subset of the dataset has 163,417 images in Word, 253,817 images in Latex and 417,234 images in Word+Latex subsets respectively. To demonstrate the effectiveness of our approach, we don’t fine-tune the model on the whole dataset. Instead, we fine-tune the model on a very small subset of the actual TableBank datasets. For latex, we only choose 1500 images randomly from the TableBank Latex for training. For creating the test set for latex, we randomly choose 1000 images from the TableBank Latex dataset, as originally done by the authors [17]. Similarly, for Word, we choose 1500 images randomly from the TableBank Word dataset for training. And again, for creating the test set, we randomly choose 1000 images from the TableBank Word dataset. We found that some annotations provided for the TableBank Word dataset images were inappropriate. We preferred not to include these images in the test set. And finally, we create a set for both latex and word by combining the randomly chosen images of word and latex train sets, putting a total number of 3000 images for training. And likewise, for testing, we create the test set by combining the randomly chosen images of test sets of latex and word, putting a total number of 2000 images.
我们还提供了对TableBank数据集的评估结果[17]。TableBank数据集是一种新的基于图像的表格检测与识别数据集,它包含基于Word和Latex文档的417K表格图像。数据集的表检测子集在Word中有163417个图像,在Latex中有253817个图像,在Word+Latex子集中有417234个图像。为了证明我们方法的有效性,我们没有对整个数据集进行微调。相反,我们在实际表库数据集的非常小的子集上对模型进行微调。对于latex,我们只从TableBank latex中随机选择1500张图像进行训练。为了创建latex的测试集,我们从TableBank latex数据集中随机选择1000幅图像,就像作者最初所做的那样[17]。同样,对于Word,我们从TableBank Word数据集中随机选择1500幅图像进行训练。同样,为了创建测试集,我们从TableBank Word数据集中随机选择1000幅图像。我们发现一些注释为TableBank提供Word数据集图像不合适。我们不希望在测试集中包含这些图像。最后,我们结合word和latex训练集随机选择的图像,为latex和word创建一个集合,总共放置3000个图像用于训练。同样,对于测试,我们通过组合latex和word测试集的随机选择的图像来创建测试集,总共放置2000个图像。
And we also evaluate the model on the ICDAR 13 [11] dataset that includes a total of 150 tables. It has two subsets as EU and US, in which there are 75 tables in 27 PDFs from the EU set and 75 tables in 40 PDFs from the US Government. We convert all of these PDFs into images and we get 238 images, out of which we use 40 randomly choose images for fine-tuning and others for testing.
我们还评估了11个数据集的ICDAR数据集。它有两个子集,即欧盟和美国,其中欧盟集合的27个PDF中有75个表格,美国40个PDF中有75个表格政府。我们将所有这些PDF文件转换成图像,我们得到238张图像,其中我们使用40张随机选择的图像进行微调,其他图像用于测试。
For creating a dataset for table structure recognition task we manually annotated some images from the ICDAR 19 (Track A Modern) train set. As discussed earlier, this dataset is annotated for three classes. We randomly choose 342 images out of 600 images of the ICDAR 19 train set. It had
114 bordered tables, 429 borderless tables and 24920 cells in borderless tables in these images and were annotated accordingly. We release this dataset to the research community. The test set for table structure recognition was provided by the cTDaR competition track B2. It contains 100 images of
all types of documents and tables.
为了创建表结构识别任务的数据集,我们手动标注了ICDAR19(跟踪现代)列车集的一些图像。如前所述,此数据集被注释为三个类。我们从ICDAR19列车组的600张图片中随机选择342张。在这些图像中,它有114个有边界的表,429个无边界表和24920个无边界表中的单元格,并进行了相应的注释。我们将这个数据集发布给研究社区。表结构识别测试集由cTDaR竞赛赛道B2提供。它包含100幅各种文档和表格的图像。 - Results and Analysis
In this section, we start by demonstrating the effectiveness of image transformation techniques by performing experiments with a baseline model. Then we show a comparative analysis of various CNN models with Cascade mask RCNN HRNet. And finally, we show the evaluation benchmarks of our model on public datasets. The experiments were performed on Google Colaboratory platform with P100 PCIE GPU of 16 GB GPU memory, Intel® Xeon® CPU @ 2.30GHz and 12.72 GB of RAM.
在本节中,我们首先通过对基线模型进行实验来演示图像转换技术的有效性。然后用级联掩模RCNN-HRNet对各种CNN模型进行了比较分析。最后,我们给出了模型在公共数据集上的评价基准。实验在Google collaboratory平台上进行,P100 pciegpu为16gbgpu内存,Intel(R)Xeon(R)CPU@2.30GHz和12.72gb的RAM。
5.1. Preliminary Analysis
To show the effectiveness of the proposed image transformation techniques, we train a baseline model on all four datasets (created by augmenting the general dataset in section 4) and evaluate the results on ICDAR 19 Modern Track A Test set. We try to obtain a dataset out of the four datasets that help the model to do better. We chose the Faster-R-CNN resnext101 64x4d (cardinality = 64 and Bottleneck width = 4) model as the baseline model. The Mmdetection toolbox was used to implement the model with the default training configurations provided by the framework.
5.1、初步分析
为了证明所提出的图像转换技术的有效性,我们对所有四个数据集(通过增加第4节中的通用数据集创建)训练一个基线模型,并在ICDAR19现代跟踪a测试集上评估结果。我们试着从四个数据集中得到一个更好的数据集。我们选择Faster-R-CNN resnext101 64x4d(基数=64,瓶颈宽度=4)模型作为基线模型。Mmdetection工具箱使用框架提供的默认训练配置来实现模型。
Evaluation metrics for ICDAR 19 dataset are based on IoU (Intersection over Union) to evaluate the performance of table region detection. Precision, Recall and, F1 scores are calculated with IoU threshold 0.6, 0.7, 0.8 and 0.9 respectively. The Weighted-Average F1 (WAvg.) is calculated by assigning a weight to each F1 value of the corresponding IoU threshold. As a result, the F1 scores with higher IoUs are given more importance than those with lower IoUs. The details of the metric are further explained by Gao et al. [8]. Table 1 shows the F1-scores for the IoU thresholds of baseline models on the ICDAR Test (Track A Modern). And, the model trained on the dataset having images of both augmentation techniques performs significantly better than other dataset models.
icdar19数据集的评价指标基于IoU(intersectionoverunion)来评价表区域检测的性能。使用IoU阈值0.6、0.7、0.8和0.9计算精确度、召回率和F1分数。加权平均F1(WAvg.)是通过给相应IoU阈值的每个F1值分配一个权重来计算的。因此,IOI较高的F1分数比IOS较低的F1分数更重要。Gao等人进一步解释了度量的细节。[8] 一。表1显示了在ICDAR测试中基线模型IoU阈值的F1分数(跟踪A现代)。而且,在具有两种增强技术图像的数据集上训练的模型比其他数据集模型表现得更好。
These results proved that both image transformation techniques for data augmentation help the model learn more effectively. So, we use both image transformation techniques on our General dataset for further experiments on the table detection task.
这些结果证明了两种用于数据增强的图像变换技术都有助于模型更有效地学习。因此,我们将这两种图像转换技术应用于我们的通用数据集,以进一步进行表检测任务的实验。
To show the comparative analysis of the CascadeTabNet model with all other Cascade R-CNN and HRNet based object detection and instance segmentation models, we use the General dataset with both augmentation techniques for training. We use Mmdetection based implementation of all the models using the default configurations. All of these models have pre-trained backbones on ImageNet dataset using training schedules as of 1x (12 epochs) and 2x (24 epochs), further described in [4]. And all models utilize the Feature Pyramid Network (FPN) neck. We fine-tunedthe following object detection and instance segmentation models.
为了比较分析CascadeTabNet模型与其他所有基于级联R-CNN和HRNet的对象检测和实例分割模型,我们使用了带有两种增强技术的通用数据集进行训练。我们使用基于Mmdetection的实现来实现所有使用默认配置的模型。所有这些模型在ImageNet数据集上都有预先训练的主干,使用的训练时间表为1x(12个时代)和2个(24个时代),详见[4]。所有模型都采用了特征金字塔网络(FPN)颈部。我们微调了以下对象检测和实例分割模型。
1.Retina : Resnext-101 based RetinaNet model with car- dinality = 32 and bottleneck width = 4d.
2.FRcnnHr : Faster R-CNN with hrnetv2p w40 backbone (40 indicates the width of the high-resolution convolution).
3.CRcnnX : Three staged Cascade R-CNN with Resnext101 backbone having cardinality = 64 and bottleneck width = 4d.
4.CRcnnHr : Three staged Cascade R-CNN with hrnetv2p w32 backbone.
5.CMRcnnD : Three staged Cascade R-CNN with Resnet50 backbone with c3-c5 (adding deformable convolutions in resnet stage 3 to 5).
6.CMRcnnX : Three staged Cascade mask R-CNN with Resnext-101 backbone having cardinality = 64 and bottleneck width = 4d.
7.CMRcnnHr : Three staged Cascade mask R-CNN with hrnetv2p w32 backbone.
1.视网膜:基于Resnext-101的视网膜模型,car-dinality=32,瓶颈宽度=4d。
2.FRcnnHr:hrnetv2p w40主干更快的R-CNN(40表示高分辨率卷积的宽度)。
3.CRcnnX:三级级联R-CNN,Resnext101主干,基数=64,瓶颈宽度=4d。
4CRcnnHr:三级级联R-CNN,带hrnetv2p w32主干。
5.CMRcnnD:三级级联R-CNN,带有Resnet50主干和c3-c5(在resnet 3-5级中添加可变形卷积)。
6.CMRcnnX:三级级联掩模R-CNN,具有Resnext-101主干,基数=64,瓶颈宽度=4d。
7.CMRcnnHr:hrnetv2p w32主干三级级联掩模R-CNN。
Table 2 shows the evaluated F1-scores of all models on the ICDAR Test (Track A Modern) set. As seen in the table, the multi-stage cascaded network methodology along with HRNet backbone based models dominate other models. And, instance segmentation models do better than the object detection models. The Cascade mask R-CNN HRNet models achieves the highest accuracy among all models because of the fusion of two methodologies of multi-staged cascading and high-resolution convolutions used for instance segmentation.
表2显示了所有车型在ICDAR测试(现代赛道)上的F1评分。如表所示,多级级联网络方法以及基于HRNet主干网的模型主导了其他模型。实例分割模型比目标检测模型有更好的效果。级联掩模R-CNN-HRNet模型融合了多阶段级联和高分辨率卷积两种方法进行实例分割,在所有模型中达到了最高的精度。
5.2. Table detection evaluation
We again perform the iterative transfer learning technique to fine-tune our General model (Cascade mask R-CNN HRNet) on ICDAR 13, ICDAR 19 and TableBank datasets respectively for evaluation.
我们再次执行迭代转移学习技术来微调我们的通用模型(级联掩模R-CNN-HRNet),分别对icdar13、icdar19和TableBank数据集进行评估。
First, we fine-tune Cascade mask R-CNN HRNet on the ICDAR 19 track A train set along with dilation transform augmentation, and the following results were obtained on the modern tack A test set. We achieved 3rd rank on the postcompetition leader board according to weighted-average metrics but attained the best accuracy for IoU 0.9, Table 3. The winner of the competition TableRadar performs two types of post-processing over the original output from the network. They merge the regions whose overlapped areas are larger than the defined threshold. And, detect lines in candidate table regions such that if the detected line extends over table-border, the table region is extended accordingly.
The runner up NLPR PAL used Fully Convolutional Network (FCN) to classify image pixels into two categories: table and background, then table regions are extracted with Connected Component Analysis (CCA). Further details about both the datasets are described in [8]. The advantage of our approach over the approaches of the winner and runner-up is that both of these approaches involve some kind of post-processing after the original output of the network. But, in our approach, we do not perform any type of post-processing. Our model directly outputs the accurate table region masks leveraging its architectural design and the techniques implanted during its training.
首先,我们在ICDAR 19轨道A列车组上对级联掩模R-CNN HRNet进行了微调,并在现代tack A试验台上得到了以下结果。根据加权平均指标,我们在赛后领导委员会中排名第三,但在IoU 0.9(表3)中达到了最佳精确度。比赛的胜利者TableRadar对网络的原始输出执行两种后处理。它们合并重叠区域大于定义阈值的区域。并且,检测候选表区域中的行,使得如果检测到的行超出表边界,则相应地扩展表区域
亚军NLPR-PAL采用全卷积网络(FCN)将图像像素分为表和背景两类,然后用连通分量分析(CCA)提取表区域。关于这两个数据集的更多细节见[8]。与优胜者和亚军方法相比,我们的方法的优势在于,这两种方法都涉及到网络原始输出后的某种后处理。但是,在我们的方法中,我们不执行任何类型的后处理。我们的模型直接输出精确的表区域掩码,利用其架构设计和在训练期间植入的技术。
Evaluation metrics for TableBank dataset for table detection are based on, calculating the Precision, Recall, and F1 in the same way as in [9], where the metrics for all documents are computed by summing up the area of overlap, prediction, and ground truth. At this point, we want to emphasize that, we only use 1500 images from word, 1500 from latex and 3000 images for word+latex datasets for training(finetuning) the models. We achieved the best accuracy results for all of the three subsets, Table 4.
用于表检测的TableBank数据集的评估度量是基于计算精度、召回率和F1的,方法与[9]中的方法相同,其中所有文档的度量都是通过对重叠面积、预测和基本真实性的相加来计算的。在这一点上,我们要强调的是,我们只使用来自word的1500个图像、来自latex的1500个图像和word+latex数据集的3000个图像来训练(微调)模型。我们在三个子集中获得了最好的精确度结果,如表4所示。
Evaluation metrics for ICDAR 2013 is based on completeness and purity of the sub-objects of a table. We calculate precision and recall for each table and then take the average, as done by [18]. The metrics is further described by [18], [10] and [24]. We only use 40 images from the dataset for fine-tuning the general model and 198 images for testing, while [18] and [21] used only 34 images for testing and rest of the dataset for training. Results are shown in Table 5.
ICDAR 2013的评估指标基于表中子对象的完整性和纯度。我们计算每个表的精确度和召回率,然后取平均值,如[18]所述。这些指标在[18]、[10]和[24]中有进一步的描述。我们仅使用数据集中的40幅图像对通用模型进行微调,198幅图像用于测试,而[18]和[21]仅使用34幅图像进行测试,其余数据集用于训练。结果见表5。
5.3. Table structure recognition evaluation
We trained the general model on our annotated dataset, and this model is included in the final pipeline. The results are evaluated on the ICDAR 19 Track B2 dataset. The evaluation for this track is done by comparing the structure of a table that is defined as a matrix of cells. For each cell, it is required to return the coordinates of a polygon defining the convex hull of the cell’s contents. Additionally, it
also requires the start/end column/row information for each cell. It uses cell adjacency relation-based table structure evaluation (based on Gobel et al. [10]). Similar to track A, precision, recall and, F1 scores are calculated with IoU threshold of 0.6, 0.7, 0.8 and 0.9 respectively. We attain the highest accuracy on the post-competition leaderboard (Table 6), but some high-end post-processing can improve the results significantly.
我们在带注释的数据集上训练通用模型,这个模型包含在最终的管道中。结果在ICDAR 19轨道B2数据集上进行了评估。此轨迹的计算是通过比较定义为单元格矩阵的表的结构来完成的。对于每个单元格,需要返回定义单元格内容的凸包的多边形坐标。此外,它还需要每个单元格的开始/结束列/行信息。它使用基于单元邻接关系的表结构评估(基于Gobel等人。[10] )。与track A相似,IoU阈值分别为0.6、0.7、0.8和0.9,计算精确度、查全率和F1分数。我们在赛后排行榜上获得了最高的准确率(表6),但一些高端的后处理可以显著提高结果。
Figure 5 shows the results of our model. It predicts yellow masks for bordered tables (5 a.) and purple masks for borderless tables (5 b.). It predicts accurate cell masks for most of the borderless tables.
For some images where some of the predictions for cells are missed by the model (5 c.), we correct it using line estimation and contour-based text detection algorithm. The model fails badly for some images (5 d.)
图5显示了模型的结果。它预测有边界表(5a)的黄色掩码和无边界表(5b)的紫色掩码。它预测大多数无边界表的精确单元掩码。对于模型(5c)漏掉了对细胞的一些预测的图像,我们使用直线估计和基于轮廓的文本检测算法对其进行校正。该模型在某些图像(5d)中失败
6.Conclusion
The paper presented an end-to-end system for table detection and structure recognition. It is shown that existing instance segmentation based CNN architectures which were originally trained for objects in natural scene images are also very effective for detecting tables. And, iterative transfer learning and image augmentation techniques can be used to learn efficiently from a small amount of data. The proposed model recognizes structures within tables by predicting table cell masks while using the line information as well. Improving the post-processing modules can further enhance the accuracy. Our system performs better on various public datasets for both the tasks. We thank Akshay Navalakha (AP Analytica) for his idea and guidance in the initial project of invoice-document parsing that we developed for him.
本文提出了一个端到端的表格检测与结构识别系统。研究结果表明,现有的基于实例分割的CNN体系结构最初是针对自然场景图像中的对象进行训练的,对于表的检测也是非常有效的。迭代传递学习和图像增强技术可以有效地从少量数据中学习。该模型在使用行信息的同时,通过预测表单元掩码来识别表中的结构。改进后处理模块可以进一步提高精度。对于这两个任务,我们的系统在各种公共数据集上都表现得更好。我们感谢Akshay Navalakha(AP Analytica)在我们为他开发的发票文档解析的初始项目中提供的想法和指导。