基于深度学习、OpenCV文本图像表格提取

这篇博客介绍了如何基于Mask R-CNN和形态学操作来提取文本图像中的表格区域。首先,训练Mask R-CNN网络进行表格区域检测,接着利用开运算提取线段。文章详细探讨了R-CNN、Mask R-CNN的工作原理,并讨论了两种线检测方法的优缺点。
摘要由CSDN通过智能技术生成

基于Mask R-CNN以及形态学操作的文本图像中表格区域的提取

训练一个可用于检测表格区域的分割网络,在得到网络的输出即文本图像中表格区域后,利用形态学中的开运算提取表格区域中的线段用于后续的操作。

Mask R-CNN:

MaskRcnn顾名思义由两部分组成:mask、rcnn,在原来rcnn网络的基础上加入了一个mask分支,同时实现了目标检测和实体分割,个人认为是非常值得深入研究的一个架构。

R-CNN:

rcnn网络是非常经典的two-stage目标检测网络,在很多目标检测数据集上都能达到SOT水准。本文会做一个概括的说明,更多的细节可以参考Google。

stage1:
通过RPN网络得到Region Proposal。
stage2:
stage1的结果进行bounding box回归和分类,实现目标检测。

FasterRcnn
如上图,首先输入图像经由共享卷积层提取feature maps,然后一个分支输入到RPN网络进行Region Proposal,另一个分支结合rpn网络的输出得到ROI,通过池化层得到固定维度的特征向量,最后进行框回归和分类。
RPN网络
接收feature maps,然后在feature maps每一点做33的卷积,channel=512(VGG-16)如果前面的卷积网络是VGG16架构的,每一个点生成9个anchor,3个尺寸3种比例,之后会通过两个11的卷积输入到框回归分支和分类分支,框回归用来回归anchor,分类分支判断anchor是否positive,也即是anchor中是否有object。
理论上anchor正负样本的比例是1:1,总数是256,如果一方过少则要进行补充保证类别平衡。理论上128个正anchor对应图中128个object,之后还会根据NMS对生成的bounding boxes进行合并去重。
需要注意的是RPN输出的不是直接的坐标信息而是针对anchor的偏移量,anchor的坐标加上这些偏移量再经过NMS才是有意义的object的坐标,同时RPN的输出是直接映射到原始图像大小的而不是与feature maps 对齐。

stage2
rpn 的输出与feature maps 相结合,也即是根据anchor加上rpn输出的偏移量得到bounding boxes,然后在feature maps 上定位bounding boxes得到ROIs,之后经过一个max pooling 或者 avg pooling,得到固定大小的feature maps,然后这个feature maps会送到两个FC分支中进行框回归和softmax多分类识别。

所以,Faster-Rcnn网络是有4个loss函数需要优化的,RPN网络的两个以及最后FC layer之后的两个。

关于Faster-Rcnn更多细节可以参考这篇文章:
link. 一文读懂Faster RCNN
以及原始 paper:
link Faster R-CNN: Towards Real-Time Object
Detection with Region Proposal Networks
还可以通过Google获取更多更详细的资料。

Mask

需要对U-net语义分割网络有一定了解。
part 1
U-net语义分割网络。U-net网络是由一个FCN(全卷积网络)融合FPN(特征金字塔网络)结构得到的。

Unet
在这之前需要对区分一些概念:

1、image classification

2、object detection

3、semantic segmentation

4、instance segmentation

1、image classification
给一张图,输出一个唯一的label。前提条件是图中有且只有一个物体。它是计算机视觉中最基础的一个分支。

2、object detection
图中同时存在多种类别的物体,既要获得它们的label同时还要确切的知道在图中的位置,即 what 和 where。

3、semantic segmentation
针对图中的每一个像素输出它们确切的label以及位置,是一个逐像素分类任务。

4、instance segmentation
在3的基础上更近一步,把属于同一类的个体分割开,比如一幅图中语义分割车这一类别有3个实体,在3中它们是用同一个车的标识符标记的,而实体分割则是要把它们用不同的标识符标记分割开来。

可以结合下图进行理解。

help
回到U-net,它是一个语义分割网络。网络经过4次down sampling提取特征,判断图中是什么(what),但是为了获得位置信息(where),所以还需要进行4次 up
sampling,使图像的尺寸恢复到原始大小,同时为了利用低层的语义信息也即特征,上采样后的特征与相对应的低层特征进行融合(add,concatenate)。
下图是对U-net结构的详细展开:

Unet展开图
上述过程可以概括成:

Input (128x128x1) => Encoder =>(8x8x256) => Decoder =>Ouput (128x128x1)

基于keras 的实现:

def conv2d_block(input_tensor, n_filters, kernel_size = 3, batchnorm = True):
    """Function to add 2 convolutional layers with the parameters passed to it"""
    # first layer
    x = Conv2D(filters = n_filters, kernel_size = (kernel_size, kernel_size),\
              kernel_initializer = 'he_normal', padding = 'same')(input_tensor)
    if batchnorm:
        x = BatchNormalization()(x)
    x = Activation('relu')(x)
    
    # second layer
    x = Conv2D(filters = n_filters, kernel_size = (kernel_size, kernel_size),\
              kernel_initializer = 'he_normal', padding = 'same')(x)
    if batchnorm:
        x = BatchNormalization()(x)
    x = Activation('relu')(x)
    
    return x
  
  def get_unet(input_img, n_filters = 16, dropout = 0.1, batchnorm = True):
    # Contracting Path
    c1 = conv2d_block(input_img, n_filters * 1, kernel_size = 3, batchnorm = batchnorm)
    p1 = MaxPooling2D((2, 2))(c1)
    p1 = Dropout(dropout)(p1)
    
    c2 = conv2d_block(p1, n_filters * 2, kernel_size = 3, batchnorm = batchnorm)
    p2 = MaxPooling2D((2, 2))(c2)
    p2 = Dropout(dropout)(p2)
    
    c3 = conv2d_block(p2, n_filters * 4, kernel_size = 3, batchnorm = batchnorm)
    p3 = MaxPooling2D((2, 2))(c3)
    p3 = Dropout(dropout)(p3)
    
    c4 = conv2d_block(p3, n_filters * 8, kernel_size = 3, batchnorm = batchnorm)
    p4 = MaxPooling2D((2, 2))(c4)
    p4 = Dropout(dropout)(p4)
    
    c5 = conv2d_block(p4, n_filters = n_filters * 16, kernel_size = 3, batchnorm = batchnorm)
    
    # Expansive Path
    u6 = Conv2DTranspose(n_filters * 8, (3, 3), strides = (2, 2), padding = 'same')(c5)
    u6 = concatenate([u6, c4])
    u6 = Dropout(dropout)(u6)
    c6 = conv2d_block(u6, n_filters * 8, kernel_size = 3, batchnorm = batchnorm)
    
    u7 = Conv2DTranspose(n_filters * 4, (3, 3), strides = (2, 2), padding = 'same')(c6)
    u7 = concatenate([u7, c3])
    u7 = Dropout(dropout)(u7)
    c7 = conv2d_block(u7, n_filters * 4, kernel_size = 3, batchnorm = batchnorm)
    
    u8 = Conv2DTranspose(n_filters * 2, (3, 3), strides = (2, 2), padding = 'same')(c7)
    u8 = concatenate([u8, c2])
    u8 = Dropout(dropout)(u8)
    c8 = conv2d_block(u8, n_filters * 2, kernel_size = 3, batchnorm = batchnorm)
    
    u9 = Conv2DTranspose(n_filters * 1, (3, 3), strides = (2, 2), padding = 'same')(c8)
    u9 = concatenate([u9, c1])
    u9 = Dropout(dropout)(u9)
    c9 = conv2d_block(u9, n_filters * 1, kernel_size = 3, batchnorm = batchnorm)
    
    outputs = Conv2D(1, (1, 1), activation='sigmoid')(c9)## 多分类用softmax,二分类用sigmoid。
    model = Model(inputs=[input_img], outputs=[outputs])
    return model

关于U-net更多的细节可以参考下面这篇文章:
link Understanding Semantic Segmentation with UNET
或者Google查找更多资料。

Mask R-CNN

先看一下架构图:
maskrcnn
在Faster R-CNN上主要有三点改变,引入FPN结构,用Roi Align 代替原始的ROI pooling,加入mask分支。

FPN
FPN结构的原理和实现可以参考U-NET,本质上没有区别。FPN结构用在提取feature maps的backbone网络上,backbone可以是resnet50、resnet101等。

无论是resnet50,还是resnet101,它们都有两种block,shortcut不需要卷积的identity_block,需要卷积的conv_block,判断标准是经过block是否进行了下采样,有下采样则需要conv_block,无下采样则不需要,通常 stride=1 时是经过identity block,stride=2 时经过conv_block.
基于keras的实现:

identity_block

def identity_block(input_tensor, kernel_size, filters, stage, block,
				   use_bias=True, train_bn=True):
	"""The identity_block is the block that has no conv layer at shortcut
	# Arguments
		input_tensor: input tensor
		kernel_size: default 3, the kernel size of middle conv layer at main path
		filters: list of integers, the nb_filters of 3 conv layer at main path
		stage: integer, current stage label, used for generating layer names
		block: 'a','b'..., current block label, used for generating layer names
		use_bias: Boolean. To use or not use a bias in conv layers.
		train_bn: Boolean. Train or freeze Batch Norm layers
	"""
	nb_filter1, nb_filter2, nb_filter3 = filters
	conv_name_base = 'res' + str(stage) + block + '_branch'
	bn_name_base = 'bn' + str(stage) + block + '_branch'

	x = KL.Conv2D(nb_filter1, (1, 1), name=conv_name_base + '2a',
				  use_bias=use_bias)(input_tensor)
	x = BatchNorm(name=bn_name_base + '2a')(x, training=train_bn)
	x = KL.Activation('relu')(x)

	x = KL.Conv2D(nb_filter2, (kernel_size, kernel_size), padding='same',
				  name=conv_name_base + '2b', use_bias=use_bias)(x)
	x = BatchNorm(name=bn_name_base + '2b')(x, training=train_bn)
	x = KL.Activation('relu')(x)

	x = KL.Conv2D(nb_filter3, (1, 1), name=conv_name_base + '2c',
				  use_bias=use_bias)(x)
	x = BatchNorm(name=bn_name_base + '2c')(x, training=train_bn)

	x = KL.Add()([x, input_tensor])
	x = KL.Activation('relu', name='res' + str(stage) + block + '_out')(x)
	return x

conv_block

def conv_block(input_tensor, kernel_size, filters, stage, block,
			   strides=(2, 2), use_bias=True, train_bn=True):
	"""conv_block is the block that has a conv layer at shortcut
	# Arguments
		input_tensor: input tensor
		kernel_size: default 3, the kernel size of middle conv layer at main path
		filters: list of integers, the nb_filters of 3 conv layer at main path
		stage: integer, current stage label, used for generating layer names
		block: 'a','b'..., current block label, used for generating layer names
		use_bias: Boolean. To use or not use a bias in conv layers.
		train_bn: Boolean. Train or freeze Batch Norm layers
	Note that from stage 3, the first conv layer at main path is with subsample=(2,2)
	And the shortcut should have subsample=(2,2) as well
	"""
	nb_filter1, nb_filter2, nb_filter3 = filters
	conv_name_base = 'res' + str(stage) + block + '_branch'
	bn_name_base = 'bn' + str(stage) + block + '_branch'

	x = KL.Conv2D(nb_filter1, (1, 1), strides=strides,
				  name=conv_name_base + '2a', use_bias=use_bias)(input_tensor)
	x = BatchNorm(name=bn_name_base + '2a')(x, training=train_bn)
	x = KL.Activation('relu')(x)

	x = KL.Conv2D(nb_filter2, (kernel_size, kernel_size), padding='same',
				  name=conv_name_base + '2b', use_bias=use_bias)(x)
	x = BatchNorm(name=bn_name_base + '2b')(x, training=train_bn)
	x = KL.Activation('relu')(x)

	x = KL.Conv2D(nb_filter3, (1, 1), name=conv_name_base +
				  '2c', use_bias=use_bias)(x)
	x = BatchNorm(name=bn_name_base + '2c')(x, training=train_bn)

	shortcut = KL.Conv2D(nb_filter3, (1, 1), strides=strides,
						 name=conv_name_base + '1', use_bias=use_bias)(input_tensor)
	shortcut = BatchNorm(name=bn_name_base + '1')(shortcut, training=train_bn)

	x = KL.Add()([x, shortcut])
	x = KL.Activation('relu', name='res' + str(stage) + block + '_out')(x)
	return x

resnet-FPN

def resnet_graph(input_image, architecture, stage5=False, train_bn=True):
	"""Build a ResNet graph.
		architecture: Can be resnet50 or resnet101
		stage5: Boolean. If False, stage5 of the network is not created
		train_bn: Boolean. Train or freeze Batch Norm layers
	"""
	assert architecture in ["resnet50", "resnet101"]
	# Stage 1
	x = KL.ZeroPadding2D((3, 3))(input_image)
	x = KL.Conv2D(64, (7, 7), strides=(2, 2), name='conv1', use_bias=True)(x)
	x = BatchNorm(name='bn_conv1')(x, training=train_bn)
	x = KL.Activation('relu')(x)
	C1 = x = KL.MaxPooling2D((3, 3), strides=(2, 2), padding="same")(x)
	# Stage 2
	x = conv_block(x, 3, [64, 64, 256], stage=2, block='a', strides=(1, 1), train_bn=train_bn)
	x = identity_block(x, 3, [64, 64, 256], stage=2, block='b', train_bn=train_bn)
	C2 = x = identity_block(x, 3, [64, 64, 256], stage=2, block='c', train_bn=train_bn)
	# Stage 3
	x = conv_block(x, 3, [128, 128, 512], stage=3, block='a', train_bn=train_bn)
	x = identity_block(x, 3, [128, 128, 512], stage=3, block='b', train_bn=train_bn)
	x = identity_block(x, 3, [128, 128, 512], stage=3, block='c', train_bn=train_bn)
	C3 = x = identity_block(x, 3, [128, 128, 512], stage=3, block='d', train_bn=train_bn)
	# Stage 4
	x = conv_block(x, 3, [256, 256, 1024], stage=4, block='a', train_bn=train_bn)
	block_count = {"resnet50": 5, "resnet101": 22}[architecture]
	for i in range(block_count):
		x = identity_block(x, 3, [256, 256, 1024], stage=4, block=chr(98 + i), train_bn=train_bn)
	C4 = x
	# Stage 5
	if stage5:
		x = conv_block(x, 3, [512, 512, 2048], stage=5, block='a', train_bn=train_bn)
		x = identity_block(x, 3, [512, 512, 2048], stage=5, block='b', train_bn=train_bn)
		C5 = x = identity_block(x, 3, [512, 512, 2048], stage=5, block='c', train_bn=train_bn)
	else:
		C5 = None
	return [C1, C2, C3, C4, C5]

_,C2,C3,C4,C5 = resnet_graph(******) ##根据实际情况设置参数
## 考虑到内存显存等因素,不使用C1	
config.TOP_DOWN_PYRAMID_SIZE = 256
P5 = KL.Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (1, 1), name='fpn_c5p5')(C5)
P4 = KL.Add(name="fpn_p4add")([
			KL.UpSampling2D(size=(2, 2), name="fpn_p5upsampled")(P5),
			KL.Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (1, 1), name='fpn_c4p4')(C4)])
P3 = KL.Add(name="fpn_p3add")([
			KL.UpSampling2D(size=(2, 2), name="fpn_p4upsampled")(P4),
			KL.Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (1, 1), name='fpn_c3p3')(C3)])
P2 = KL.Add(name="fpn_p2add")([
			KL.UpSampling2D(size=(2, 2), name="fpn_p3upsampled")(P3),
			KL.Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (1, 1), name='fpn_c2p2')(C2)])
# Attach 3x3 conv to all P layers to get the final feature maps.
P2 = KL.Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (3, 3), padding="SAME", name="fpn_p2")(P2)
P3 = KL.Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (3, 3), padding="SAME", name="fpn_p3")(P3)
P4 = KL.Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (3, 3), padding="SAME", name="fpn_p4")(P4)
P5 = KL.Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (3, 3), padding="SAME", name="fpn_p5")(P5)
# P6 is used for the 5th anchor scale in RPN. Generated by
# subsampling from P5 with stride of 2.
P6 = KL.MaxPooling2D(pool_size=(1, 1), strides=2, name="fpn_p6")(P5)
# Note that P6 is used in RPN, but not in the classifier heads. P6只用于RPN网络。
rpn_feature_maps = [P2, P3, P4, P5, P6]
mrcnn_feature_maps = [P2, P3, P4, P5]

ROI Align

原始的ROI pooling有两次取整的操作,RPN网络输出的4个偏移量是浮点类型,取整时势必有一定的误差,其次pooling时各个cell的边界进行取整进一步导致bounding boxes 发生偏移错位。ROI Align取消了取整操作,采用双线性差值的方法计算浮点型坐标的图像数值。
下图是两种方式的简单对比:

ROialign
一个问题:应该从[P2,P3,P4,P5]哪一个分辨率的feature map上做ROI Align 切出ROI呢?

假设训练图像的w,h = 224,k0 = 4,则可以通过下面的公式计算得到对应的Pk

gongshi

其中224要根据实际训练图像的w和h进行调整,公式中的w,h是指ROI的宽和高。从P2 到P5 分辨率变低,如果ROI较大也即是大物体,应该从低分辨率的feature map上检测,如果ROI较小也即是小目标,应该从高分辨率的feature map上检测,这样很合理。

更多关于ROI Align的细节可以参考这篇文章:

link 令人拍案称奇的Mask RCNN

Mask分支

参考U-NET,去掉Faster R-CNN分支,这就是一个变形的U-NET网络,只不过是在FPN结构上加入了ROI Align操作。Mask 分支对ROI Align之后的feature maps 做卷积操作得到最终的mask,mask可以不是直接原图大小的,后面可以通过变换映射到原图大小。
参考代码:

x = KL.TimeDistributed(KL.Conv2D(256, (3, 3), padding="same"),
						   name="mrcnn_mask_conv1")(x)
	x = KL.TimeDistributed(BatchNorm(),
						   name='mrcnn_mask_bn1')(x, training=train_bn)
	x = KL.Activation('relu')(x)

	x = KL.TimeDistributed(KL.Conv2D(256, (3, 3), padding="same"),
						   name="mrcnn_mask_conv2")(x)
	x = KL.TimeDistributed(BatchNorm(),
						   name='mrcnn_mask_bn2')(x, training=train_bn)
	x = KL.Activation('relu')(x)

	x = KL.TimeDistributed(KL.Conv2D(256, (3, 3), padding="same"),
						   name="mrcnn_mask_conv3")(x)
	x = KL.TimeDistributed(BatchNorm(),
						   name='mrcnn_mask_bn3')(x, training=train_bn)
	x = KL.Activation('relu')(x)

	x = KL.TimeDistributed(KL.Conv2D(256, (3, 3), padding="same"),
						   name="mrcnn_mask_conv4")(x)
	x = KL.TimeDistributed(BatchNorm(),
						   name='mrcnn_mask_bn4')(x, training=train_bn)
	x = KL.Activation('relu')(x)

	x = KL.TimeDistributed(KL.Conv2DTranspose(256, (2, 2), strides=2, activation="relu"),
						   name="mrcnn_mask_deconv")(x)
	x = KL.TimeDistributed(KL.Conv2D(num_classes, (1, 1), strides=1, activation="sigmoid"),
						   name="mrcnn_mask")(x)

经过四层 channel=256,kernel size = 3 的常规卷积,一层up sampling *2 的转置卷积最后通过kernel size = 1的卷积以及sigmoid激活函数得到相对输入feature map两倍大的mask。

关于Mask R-CNN的更多细节可以参考:

link Image segmentation with Mask R-CNN
link 令人拍案称奇的Mask RCNN
以及Google查找更多更详细的资料。

下面的部分是关于如何使用Mask R-CNN训练表格数据集用于表格检测的。

这一部分的资源有很多,建议参考GitHub上的开源Mask R-CNN项目。

link Mask R-CNN

这个项目中有几个例子可以参考,仿照例子的写法针对自己的数据集构造代码。
表格数据集推荐 Table Bank,这个GitHub上有开源,不过需要提交申请,返回的邮件中有下载链接。
link TableBank
原始的数据集中只标注了表格部分,博主自己标注了一部分figure以及image数据用于别的需求。

第一步
1、在samples/目录下新建一个目录:table_image_figure
2、在这个目录下新建一个python文件:table_image_figure.py
3、copy samples/ 目录下的balloon/balloon.py 到table_image_figure.py
4、针对自己的数据集以及标注格式修改table_image_figure.py文件。

首先,生成配置文件:

class TableImageFigureConfig(Config):
    """Configuration for training on the toy  dataset.
    Derives from the base Config class and overrides some values.
    """
    # Give the configuration a recognizable name
    NAME = "table_image_figure"
    GPU_COUNT = 1 #不要用多个GPU,否则训练时会卡死。
    # We use a GPU with 12GB memory, which can fit two images.
    # Adjust down if you use a smaller GPU.
    IMAGES_PER_GPU = 2

    # Number of classes (including background) 训练数据总类别数加上背景
    NUM_CLASSES = 1 + 3  # Background + table + image + figure
	
    # Number of training steps per epoch 每个epoch训练的步长,根据实际情况确定。
    STEPS_PER_EPOCH = 500

    # Skip detections with < 90% confidence 如果像素的socre < 0.9则视为背景。
    DETECTION_MIN_CONFIDENCE = 0.9

然后根据训练样本的结构定义数据读取方式:

class Table_image_figure_Dataset(utils.Dataset):

    def load_table(self, dataset_dir, subset):
        """Load a subset of your  dataset.
        dataset_dir: Root directory of the dataset.
        subset: Subset to load: train or val
        """
        # Add classes. We have three class to add.根据实际情况修改。
        self.add_class("table_image_figure", 1, "Table")
        self.add_class("table_image_figure", 2, "Image")
        self.add_class("table_image_figure", 3, "Figure")
        # Train or validation dataset? 你的训练样本包含两个子目录:train、val,之后会分别读取它们中的图像以及label。
        assert subset in ["train", "val"]

        dataset_dir = os.path.join(dataset_dir, subset)
        images = glob.glob(dataset_dir+'/*.jpg')
        name_dict = {"Table":1,"Image":2,"Figure":3}#根据实际情况修改即可。博主的label文件是xml格式的如果是别的格式按照对应的方式读取即可。
        for image in images:
            name_id = []
            image_name = image.split('/')[-1]
            image_path = image
            img = skimage.io.imread(image_path)
            annotation = image_name[:-4] + '.xml'
            height, width = img.shape[:2]
            polygons = []
            in_file = open(os.path.join(dataset_dir,annotation))
            tree = ET.parse(in_file)
            root = tree.getroot()
			for obj in root.iter('object'):
                current = list()
                name = obj.find('name').text
                xmlbox = obj.find('bndbox')
                x_min = int(xmlbox.find('xmin').text)
                y_min = int(xmlbox.find('ymin').text)
                x_max = int(xmlbox.find('xmax').text)
                y_max = int(xmlbox.find('ymax').text)
                name_id.append(name_dict[name])
                polygons.append({'x':[x_min,x_max,x_max,x_min],'y':[y_min,y_min,y_max,y_max]})
            self.add_image(
                "table_image_figure",
                image_id=image_name,  # use file name as a unique image id
                path=image_path,
                class_id=name_id,
                width=width, height=height,
                polygons=polygons)

	def load_mask(self, image_id):
        """Generate instance masks for an image.
       Returns:
        masks: A bool array of shape [height, width, instance count] with
            one mask per instance.	mask的channel对应一张图中的label数。如果有5个label即使类别是一样的,mask的channel也是5.
        class_ids: a 1D array of class IDs of the instance masks.  Modify it.
        """
        # If not your dataset image, delegate to parent class.
        image_info = self.image_info[image_id]
        if image_info["source"] != "table_image_figure":
            return super(self.__class__, self).load_mask(image_id)

        # Convert polygons to a bitmap mask of shape
        # [height, width, instance_count]
        name_id = image_info["class_id"]
        info = self.image_info[image_id]
        mask = np.zeros([info["height"], info["width"], len(info["polygons"])],
                        dtype=np.uint8)### mask的channel对应instance的数目。
        class_ids = np.array(name_id,dtype=np.int32)
        for i, p in enumerate(info["polygons"]):
            # Get indexes of pixels inside the polygon and set them to 1
            rr, cc = skimage.draw.polygon(p['y'], p['x'])
            mask[rr, cc, i] = 1

        # Return mask, and array of class IDs of each instance. Since we have
        # one class ID only, we return an array of 1s
        return mask.astype(np.bool), class_ids
    def image_reference(self, image_id):
        """Return the path of the image."""
        info = self.image_info[image_id]
        if info["source"] == "table_image_figure":
            return info["path"]
        else:
            super(self.__class__, self).image_reference(image_id)

之后修改训练文件:

def train(model):
    """Train the model."""
    # Training dataset.
    dataset_train = Table_image_figure_Dataset()
    dataset_train.load_table(args.dataset, "train")
    dataset_train.prepare()

    # Validation dataset
    dataset_val = Table_image_figure_Dataset()
    dataset_val.load_table(args.dataset, "val")
    dataset_val.prepare()
    # *** This training schedule is an example. Update to your needs ***
    # Since we're using a very small dataset, and starting from
    # COCO trained weights, we don't need to train too long. Also,
    # no need to train all layers, just the heads should do it.
    print("Training network heads")
    model.train(dataset_train, dataset_val,
                learning_rate=config.LEARNING_RATE,
                epochs=30,
                layers='heads')
## args.dataset 换成自己的训练样本所在的路径。

需要注意的是虽然有validation dataset,但是训练时并没有使用,这是因为当使用多个GPU以及多进程多线程时会卡死。如果你没有遇到这个bug,则可以正常训练。

最后,还有一处修改:

if args.command == "train":
        config = TableImageFigureConfig()
    else:
        class InferenceConfig(TableImageFigureConfig):
            # Set batch size to 1 since we'll be running inference on
            # one image at a time. Batch size = GPU_COUNT * IMAGES_PER_GPU
            GPU_COUNT = 1
            IMAGES_PER_GPU = 1
        config = InferenceConfig()
    config.display()
##这个配置文件改成之前定义好的就行了。

mrcnn/model.py 大概2368行。

self.keras_model.fit_generator(
            train_generator,
            initial_epoch=self.epoch,
            epochs=epochs,
            steps_per_epoch=self.config.STEPS_PER_EPOCH,
            callbacks=callbacks,
            workers=1,
            use_multiprocessing=False,
        )
        """
        self.keras_model.fit_generator(
            train_generator,
            initial_epoch=self.epoch,
            epochs=epochs,
            steps_per_epoch=self.config.STEPS_PER_EPOCH,
            callbacks=callbacks,
            validation_data=val_generator,
            validation_steps=self.config.VALIDATION_STEPS,
            max_queue_size=100,
            workers=1,
            use_multiprocessing=False,
        )
        """
##不使用多进程不使用validation dataset,workers=1,避免训练时卡死。如果没有遇到这个bug:https://github.com/matterport/Mask_RCNN/issues/1588,则可以使用多进程以及验证数据集。

至于一些别的参数可以根据实际情况进行调整。

博主训练了30个epoch,每个epoch500步,batchsize=2.
测试还在进行中。
完整的项目整理后会放到GitHub上。

第二部分

基于OpenCV抽取图像中表格的可见横线以及竖线。

基于第一部分已经定位到表格区域,这一部分要做的就是如何从表格区域中抽取可见的横线和竖线,对于不可见的线暂时不做处理。

用OpenCV检测线有两种方式,一种是通过霍夫变换检测直线,另一种是用形态学运算得到线区域进而得到线段。
基于霍夫变换的直线检测方法鲁棒性强,缺点是直来直去对于复杂的表格结构无能为力,需要做大量的后处理,同时对于较短的表格线检测阈值的设计也不简单。
基于形态学的方法比较直接,在二值图中做联通区域分析得到线区域的外接矩形,通过求平均值的方式得到每一条线段,缺点是鲁棒性差,图像背景复杂时线段的抽取效果很差,同时易受到特殊格式:加粗,黑体,大字号等特殊字符的干扰。同样需要大量的后处理。
经过博主实测,形态学的方法还是要优于霍夫变换的方法,形态学的方法后处理相对容易实现,而要切割霍夫变换的直线则不是那么简单,当然你可以用不同的方法和技巧实现,博主是推荐形态学的方法。

博主的实现参考了这个开源项目:Camelot: PDF Table Extraction for Humans link

step1:
对于大部分文档图像来说,背景是白色的,表格线是深颜色的,直观上很容易区分背景和表格线。显然如果对图像二值化可以显著突出表格线区域,为了兼容各种不同的背景采用自适应阈值的方式对图像二值化:

“”“
blocksize=15
c=-10
blocksize : int, optional (default: 15)
		Size of a pixel neighborhood that is used to calculate a
		threshold value for the pixel: 3, 5, 7, and so on.

		For more information, refer `OpenCV's adaptiveThreshold <https://docs.opencv.org/2.4/modules/imgproc/doc/miscellaneous_transformations.html#adaptivethreshold>`_.
c : int, optional (default: -2)
		Constant subtracted from the mean or weighted mean.
		Normally, it is positive but may be zero or negative as well.

		For more information, refer `OpenCV's adaptiveThreshold <https://docs.opencv.org/2.4/modules/imgproc/doc/miscellaneous_transformations.html#adaptivethreshold>`_.

“”“
img = cv2.imread(imagename)
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
threshold = cv2.adaptiveThreshold(
			np.invert(gray),
			255,
			cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
			cv2.THRESH_BINARY,
			blocksize,
			c,
		)
#np.invet() 把白色背景变成黑色背景,黑色线区域反转成白色线区域。从黑色背景中定位白色区域显然更符合人的直观感受。
# threshold 即为二值图。两个参数可根据自己的实验进行调整。

step2:
从二值图中抽取横线和竖线。利用形态学开运算得到二值图中的线区域,求区域宽和高的均值得到单条线段。

def find_lines(
	threshold, regions=None, direction="horizontal", line_scale=15, iterations=0
):
	"""Finds horizontal and vertical lines by applying morphological
	transformations on an image.

	Parameters
	----------
	threshold : object
		numpy.ndarray representing the thresholded image.
	regions : list, optional (default: None)
		List of page regions that may contain tables of the form x1,y1,x2,y2
		where (x1, y1) -> left-top and (x2, y2) -> right-bottom
		in image coordinate space.
	direction : string, optional (default: 'horizontal')
		Specifies whether to find vertical or horizontal lines.
	line_scale : int, optional (default: 15)
		Factor by which the page dimensions will be divided to get
		smallest length of lines that should be detected.

		The larger this value, smaller the detected lines. Making it
		too large will lead to text being detected as lines.
	iterations : int, optional (default: 0)
		Number of times for erosion/dilation is applied.

		For more information, refer `OpenCV's dilate <https://docs.opencv.org/2.4/modules/imgproc/doc/filtering.html#dilate>`_.

	Returns
	-------
	dmask : object
		numpy.ndarray representing pixels where vertical/horizontal
		lines lie.
	lines : list
		List of tuples representing vertical/horizontal lines with
		coordinates relative to a left-top origin in
		image coordinate space.

	"""
	lines = []
	cont = []
	if direction == "vertical":
		#size = threshold.shape[0] // line_scale
		size = threshold.shape[0] // line_scale
		el = cv2.getStructuringElement(cv2.MORPH_RECT, (1, size))
	elif direction == "horizontal":
		size = threshold.shape[1] // line_scale
		el = cv2.getStructuringElement(cv2.MORPH_RECT, (size, 1))
	elif direction is None:
		raise ValueError("Specify direction as either 'vertical' or 'horizontal'")

	if regions is not None:
		region_mask = np.zeros(threshold.shape)
		for region in regions:
			x, y, w, h = region
			region_mask[y : y + h, x : x + w] = 1
		threshold = np.multiply(threshold, region_mask)
	threshold = cv2.erode(threshold, el)
	threshold = cv2.dilate(threshold, el)
	dmask = cv2.dilate(threshold, el, iterations=iterations)
	
	try:
		_, contours, _ = cv2.findContours(
			threshold.astype(np.uint8), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE
		)
	except ValueError:
		# for opencv backward compatibility
		contours, _ = cv2.findContours(
			threshold.astype(np.uint8), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE
		)

	for c in contours:
		x, y, w, h = cv2.boundingRect(c)
		cont.append((x,y,w,h))
		x1, x2 = x, x + w
		y1, y2 = y, y + h
		if direction == "vertical":
			lines.append(((x1 + x2) // 2, y2, (x1 + x2) // 2, y1))
		elif direction == "horizontal":
			lines.append((x1, (y1 + y2) // 2, x2, (y1 + y2) // 2))
	
	#if direction == "vertical":
		#lines = merge_lines3(lines,"vertical")
	#elif direction == "horizontal":
		#lines = merge_lines4(lines,"horizontal")
	return dmask, lines

step3:
后处理,比较大的子以及图像区域都会抽取出多余的线段,所以需要根据Mask R-CNN给出的表格区域进行过滤,判断线段是否在表格区域内,伪代码如下。

lines = []
if line in table_region:
	lines.append(line)
else:
	continue

整个流程到此就完成了,后续还需要对一些与背景比较接近的表格线段以及不可见的线段做一些特殊处理。

效果展示
左边是原始图像,右边是经过Mask R-CNN定位以及形态学处理后抽取并恢复的表格结构,可以看到二者几乎没有区别。

代码暂时不会公开。

  • 9
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

nobrody

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值