文档布局分析之dhSegment

最新推荐文章于 2023-11-06 15:56:57 发布

watersink

最新推荐文章于 2023-11-06 15:56:57 发布

阅读量2k

点赞数 2

分类专栏： OCR OCR大趴踢

本文链接：https://blog.csdn.net/qq_14845119/article/details/84859457

版权

OCR大趴踢同时被 2 个专栏收录

30 篇文章 71 订阅

订阅专栏

OCR

28 篇文章 9 订阅

订阅专栏

论文：dhSegment: A generic deep-learning approach for document segmentation

Github：https://github.com/dhlab-epfl/dhSegment

ICFHR 2018

论文主要基于U-net类型的网络，分别进行page extraction，baseline extraction，layout analysis ，multiple typologies of illustrations ， photograph extraction 这5个文档方面的任务，并且取得了很好的结果。

整体流程：

整体流程分为2步，

第一步为先基于FCN网络结构，进行mask map的预测。

第二步为对mask map的一些后处理操作，包括

(1)二值化操作（Thresholding）

该步骤使用的方法或者使用固定的0-1之间的阈值，或者使用大津法OTSU

代码：

def thresholding(probs: np.ndarray, threshold: float=-1) -> np.ndarray:
    """
    Computes the binary mask of the detected Page from the probabilities output by network.

    :param probs: array in range [0, 1] of shape HxWx2
    :param threshold: threshold between [0 and 1], if negative Otsu's adaptive threshold will be used
    :return: binary mask
    """

    if threshold < 0:  # Otsu's thresholding
        probs = np.uint8(probs * 255)
        #TODO Correct that weird gaussianBlur
        probs = cv2.GaussianBlur(probs, (5, 5), 0)

        thresh_val, bin_img = cv2.threshold(probs, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
        mask = np.uint8(bin_img / 255)
    else:
        mask = np.uint8(probs > threshold)

    return mask

(2)形态学操作（Morphological operations ）

包括膨胀，腐蚀，开，闭

代码：

def cleaning_binary(mask: np.ndarray, kernel_size: int=5) -> np.ndarray:
    """
    Uses mathematical morphology to clean and remove small elements from binary images.

    :param mask: the binary image to clean
    :param kernel_size: size of the kernel
    :return: the cleaned mask
    """

    ksize_open = (kernel_size, kernel_size)
    ksize_close = (kernel_size, kernel_size)
    mask = cv2.morphologyEx((mask.astype(np.uint8, copy=False) * 255), cv2.MORPH_OPEN, kernel=np.ones(ksize_open))
    mask = cv2.morphologyEx(mask, cv2.MORPH_CLOSE, kernel=np.ones(ksize_close))
    return mask / 255

(3)连通域分析（Connected components analysis ）

去掉在二值化后还存在的的面积较小的连通区域

代码：

def hysteresis_thresholding(probs: np.array, low_threshold: float, high_threshold: float,
                            candidates_mask: np.ndarray=None) -> np.ndarray:
    low_mask = probs > low_threshold
    if candidates_mask is not None:
        low_mask = candidates_mask & low_mask
    # Connected components extraction
    label_components, count = label(low_mask, np.ones((3, 3)))
    # Keep components with high threshold elements
    good_labels = np.unique(label_components[low_mask & (probs > high_threshold)])
    label_masks = np.zeros((count + 1,), bool)
    label_masks[good_labels] = 1
    return label_masks[label_components]

(4)形状向量化（Shape vectorization ）

这里主要对直线进行向量化，找到最长的那条直线。

代码：

from skimage.graph import MCP_Connect
from skimage.morphology import skeletonize
from skimage.measure import label as skimage_label
from sklearn.metrics.pairwise import euclidean_distances
from scipy.signal import convolve2d
from collections import defaultdict
import numpy as np


def find_lines(lines_mask: np.ndarray) -> list:
    """
    Finds the longest central line for each connected component in the given binary mask.

    :param lines_mask: Binary mask of the detected line-areas
    :return: a list of Opencv-style polygonal lines (each contour encoded as [N,1,2] elements where each tuple is (x,y) )
    """
    # Make sure one-pixel wide 8-connected mask
    lines_mask = skeletonize(lines_mask)

    class MakeLineMCP(MCP_Connect):
        def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)
            self.connections = dict()
            self.scores = defaultdict(lambda: np.inf)

        def create_connection(self, id1, id2, pos1, pos2, cost1, cost2):
            k = (min(id1, id2), max(id1, id2))
            s = cost1 + cost2
            if self.scores[k] > s:
                self.connections[k] = (pos1, pos2, s)
                self.scores[k] = s

        def get_connections(self, subsample=5):
            results = dict()
            for k, (pos1, pos2, s) in self.connections.items():
                path = np.concatenate([self.traceback(pos1), self.traceback(pos2)[::-1]])
                results[k] = path[::subsample]
            return results

        def goal_reached(self, int_index, float_cumcost):
            if float_cumcost > 0:
                return 2
            else:
                return 0

    if np.sum(lines_mask) == 0:
        return []
    # Find extremities points
    end_points_candidates = np.stack(np.where((convolve2d(lines_mask, np.ones((3, 3)), mode='same') == 2) & lines_mask)).T
    connected_components = skimage_label(lines_mask, connectivity=2)
    # Group endpoint by connected components and keep only the two points furthest away
    d = defaultdict(list)
    for pt in end_points_candidates:
        d[connected_components[pt[0], pt[1]]].append(pt)
    end_points = []
    for pts in d.values():
        d = euclidean_distances(np.stack(pts), np.stack(pts))
        i, j = np.unravel_index(d.argmax(), d.shape)
        end_points.append(pts[i])
        end_points.append(pts[j])
    end_points = np.stack(end_points)

    mcp = MakeLineMCP(~lines_mask)
    mcp.find_costs(end_points)
    connections = mcp.get_connections()
    if not np.all(np.array(sorted([i for k in connections.keys() for i in k])) == np.arange(len(end_points))):
        print('Warning : find_lines seems weird')
    return [c[:, None, ::-1] for c in connections.values()]

(5)对mask 提取其外形的端点，得到一个向量形式的坐标集合，类似opencv中的findContours操作。实际中还使用了KD树，可以得到基于坐标轴的长方形（rectangle），旋转的长方形（min_rectangle），任意凸多边形（quadrilateral）。

代码：

import cv2
import numpy as np
import math
from shapely import geometry
from scipy.spatial import KDTree


def find_boxes(boxes_mask: np.ndarray, mode: str= 'min_rectangle', min_area: float=0.2,
               p_arc_length: float=0.01, n_max_boxes=math.inf) -> list:
    """
    Finds the coordinates of the box in the binary image `boxes_mask`.

    :param boxes_mask: Binary image: the mask of the box to find. uint8, 2D array
    :param mode: 'min_rectangle' : minimum enclosing rectangle, can be rotated
                 'rectangle' : minimum enclosing rectangle, not rotated
                 'quadrilateral' : minimum polygon approximated by a quadrilateral
    :param min_area: minimum area of the box to be found. A value in percentage of the total area of the image.
    :param p_arc_length: used to compute the epsilon value to approximate the polygon with a quadrilateral.
                         Only used when 'quadrilateral' mode is chosen.
    :param n_max_boxes: maximum number of boxes that can be found (default inf).
                        This will select n_max_boxes with largest area.
    :return: list of length n_max_boxes containing boxes with 4 corners [[x1,y1], ..., [x4,y4]]
    """

    assert len(boxes_mask.shape) == 2, \
        'Input mask must be a 2D array ! Mask is now of shape {}'.format(boxes_mask.shape)

    _, contours, _ = cv2.findContours(boxes_mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    if contours is None:
        print('No contour found')
        return None
    found_boxes = list()

    h_img, w_img = boxes_mask.shape[:2]

    def validate_box(box: np.array) -> (np.array, float):
        """

        :param box: array of 4 coordinates with format [[x1,y1], ..., [x4,y4]]
        :return: (box, area)
        """
        polygon = geometry.Polygon([point for point in box])
        if polygon.area > min_area * boxes_mask.size:

            # Correct out of range corners
            box = np.maximum(box, 0)
            box = np.stack((np.minimum(box[:, 0], boxes_mask.shape[1]),
                            np.minimum(box[:, 1], boxes_mask.shape[0])), axis=1)

            # return box
            return box, polygon.area

    if mode not in ['quadrilateral', 'min_rectangle', 'rectangle']:
        raise NotImplementedError
    if mode == 'quadrilateral':
        for c in contours:
            epsilon = p_arc_length * cv2.arcLength(c, True)
            cnt = cv2.approxPolyDP(c, epsilon, True)
            # box = np.vstack(simplify_douglas_peucker(cnt[:, 0, :], 4))

            # Find extreme points in Convex Hull
            hull_points = cv2.convexHull(cnt, returnPoints=True)
            # points = cnt
            points = hull_points
            if len(points) > 4:
                # Find closes points to corner using nearest neighbors
                tree = KDTree(points[:, 0, :])
                _, ul = tree.query((0, 0))
                _, ur = tree.query((w_img, 0))
                _, dl = tree.query((0, h_img))
                _, dr = tree.query((w_img, h_img))
                box = np.vstack([points[ul, 0, :], points[ur, 0, :],
                                 points[dr, 0, :], points[dl, 0, :]])
            elif len(hull_points) == 4:
                box = hull_points[:, 0, :]
            else:
                    continue
            # Todo : test if it looks like a rectangle (2 sides must be more or less parallel)
            # todo : (otherwise we may end with strange quadrilaterals)
            if len(box) != 4:
                mode = 'min_rectangle'
                print('Quadrilateral has {} points. Switching to minimal rectangle mode'.format(len(box)))
            else:
                # found_box = validate_box(box)
                found_boxes.append(validate_box(box))
    if mode == 'min_rectangle':
        for c in contours:
            rect = cv2.minAreaRect(c)
            box = np.int0(cv2.boxPoints(rect))
            found_boxes.append(validate_box(box))
    elif mode == 'rectangle':
        for c in contours:
            x, y, w, h = cv2.boundingRect(c)
            box = np.array([[x, y], [x + w, y], [x + w, y + h], [x, y + h]], dtype=int)
            found_boxes.append(validate_box(box))
    # sort by area
    found_boxes = [fb for fb in found_boxes if fb is not None]
    found_boxes = sorted(found_boxes, key=lambda x: x[1], reverse=True)
    if n_max_boxes == 1:
        if found_boxes:
            return found_boxes[0][0]
        else:
            return None
    else:
        return [fb[0] for i, fb in enumerate(found_boxes) if i <= n_max_boxes]

网络结构：