华为昇腾NPU部署Paddle OCR（包含CPU部署表格检测与还原）记

tylunas

已于 2023-11-28 10:28:58 修改

阅读量5.1k

点赞数 34

分类专栏：深度学习文章标签： ocr paddle 华为 paddlepaddle 边缘计算

于 2023-11-23 16:27:58 首次发布

本文链接：https://blog.csdn.net/tylunas/article/details/134580215

版权

深度学习专栏收录该内容

2 篇文章

订阅专栏

在公司业务中，需要对扫描文档进行OCR识别和版面还原。之前这一块使用PaddleOCR 2.6，调用 Paddle Inference接口实现，但是经过漫长的编译和测试发现，该服务在CPU和昇腾NPU上面的性能很不理想。

在最新的Paddle OCR中，部署相关的操作已经推荐使用FastDeploy了。FastDeploy是百度生态的部署方案，其中对接了Paddle Inference、Paddle Lite、OnnxRuntime等一系列能推理百度模型的框架，封装了大量PaddlePaddle预训练模型的推理接口，以及与之相关的预处理、后处理模块。这样，FastDeploy实现了端到端推理。其中，昇腾NPU上的推理使用Paddle Lite实现。

跑过示例代码后，我们发现推理速度比较理想，于是准备开始改造。

编译 FastDeploy

FastDeploy 1.0.7 C++ API 加入了表格结构识别模型的支持，不过可惜的是作者忘记了给Python API开放返回参数。因此我这里加了一个pybind补丁。

diff --git a/fastdeploy/vision/ocr/ppocr/ocrmodel_pybind.cc b/fastdeploy/vision/ocr/ppocr/ocrmodel_pybind.cc
index 243d93e2..11fcb276 100644
--- a/fastdeploy/vision/ocr/ppocr/ocrmodel_pybind.cc
+++ b/fastdeploy/vision/ocr/ppocr/ocrmodel_pybind.cc
@@ -29,6 +29,9 @@ void BindPPOCRModel(pybind11::module& m) {
       .def_property("static_shape_infer",
                     &vision::ocr::DBDetectorPreprocessor::GetStaticShapeInfer,
                     &vision::ocr::DBDetectorPreprocessor::SetStaticShapeInfer)
+      .def_property("det_image_shape",
+                    &vision::ocr::DBDetectorPreprocessor::GetDetImageShape,
+                    &vision::ocr::DBDetectorPreprocessor::SetDetImageShape)
       .def_property("max_side_len",
                     &vision::ocr::DBDetectorPreprocessor::GetMaxSideLen,
                     &vision::ocr::DBDetectorPreprocessor::SetMaxSideLen)
diff --git a/fastdeploy/vision/vision_pybind.cc b/fastdeploy/vision/vision_pybind.cc
index 664e6cc1..c1f26fbd 100644
--- a/fastdeploy/vision/vision_pybind.cc
+++ b/fastdeploy/vision/vision_pybind.cc
@@ -160,6 +160,9 @@ void BindVision(pybind11::module& m) {
       .def_readwrite("rec_scores", &vision::OCRResult::rec_scores)
       .def_readwrite("cls_scores", &vision::OCRResult::cls_scores)
       .def_readwrite("cls_labels", &vision::OCRResult::cls_labels)
+      .def_readwrite("table_boxes", &vision::OCRResult::table_boxes)
+      .def_readwrite("table_structure", &vision::OCRResult::table_structure)
+      .def_readwrite("table_html", &vision::OCRResult::table_html)
       .def("__repr__", &vision::OCRResult::Str)
       .def("__str__", &vision::OCRResult::Str);

接下来就可以执行编译了。

Python 包编译

export WITH_ASCEND=ON
export ENABLE_VISION=ON
export ENABLE_TEXT=ON
# 刚才编译的包的路径
python3 setup.py build
python3 setup.py bdist_wheel

接下来在 python/dist/ 目录下就能找到对应的whl包了。

编写OCR代码

一个麻烦的问题是，Paddle-Lite在华为昇腾上只能用静态的shape做推理。这对于车牌识别，身份信息读取等应用还算友好，但是到了文档识别这里，这就成了大问题。因为这个问题，FastDeploy封装的PPOCRv3 pipeline算是用不了了。

Recognizer

首先是文字识别模型Recognizer，其接受一个裁剪旋转后的文字块图像做输入，输出其中的文字。FastDeploy的preprocessor会做OpenCV hwc到chw的变换，并把图像缩放到rec_image_shape。对于扫描文稿，一行的方块字数量并不固定，虽然可以直接把shape设置成整行文字，但是这样对小块的文本更不好，假如拉长一倍，字就会变得彳艮月半，直接拆开了。
所以，一个可行的办法是，按一定比例创建多个Recognizer，在推理的时候根据传入图块的长宽比，自动选择模块。

from fastdeploy.vision.ocr import Recognizer

def build_recognizer_pool(args):
    ratio_list = [2, 20/3, 10, 40/3, 16, 20, 24, 36, 40, 50]
    predictors = []
    rec_image_shape = [int(v) for v in args.rec_image_shape.split(",")]
    option, rec_model_file, rec_params_file = utility.create_option(args, 'rec', logger)
    for ratio in ratio_list:
        predictor = Recognizer(rec_model_file, rec_params_file, args.rec_char_dict_path,
                               runtime_option=option)
        processor = predictor.preprocessor
        processor.static_shape_infer = True
        processor.rec_image_shape = [rec_image_shape[0], rec_image_shape[1],
                                     int(rec_image_shape[1] * ratio)]
        predictors.append(predictor)
    return predictors

在实际使用时，还需要加入warmup过程。由于昇腾NPU Python SDK的TBE部分会使用multiprocessing创建多进程，因此在自己程序中再使用多进程、多线程会有问题，因此目前只能For循环一个个初始化了。这方面有经验的大佬可以指导一下具体做法，用多进程加快启动速度。

推理时，可以根据图像尺寸选择合适的块，并预测。

    width_ratio = img.shape[1] / float(img.shape[0])
    reognizer_idx = 0
    last_ratio = ratio_list[0]
    for idx, ratio in enumerate(self.ratio_list[1:]):
        if ratio >= width_ratio:
            if ratio - width_ratio <= width_ratio - last_ratio:
                reognizer_idx += 1
            break
        last_ratio = ratio
        reognizer_idx = idx + 1

    ocr_result = reognizers[reognizer_idx].predict(img_list[rno])

DBDetector

一般来说，如果输入是单页文档，一般纸张比例接近1.414:1，而DBDetector模型默认shape为{3, 960, 960[，即960*960。1:1比例的图像对于文档识别文字块来说问题不大。但是，还有另外一个需求。如果文本块经过版面分析模型（如picodet_lcnet_x1_0_fgd_layout_cdla），对检测后的块进行OCR的话，会得到许多单行、多行分块。由于表格标题、图注文本块长宽比跟1:1差异太大了，将其输入模型的时候文本识别效果显著变差；另一方面由于版面分析会输出多行文本块，你又不得不把所有文本块送入这个文本检测模型。
那就只能也对DBDetector改造了，一次创建多个模型。

注意，FastDeploy 1.0.7 Python API里，DBDetectorPreprocessor 是没有det_image_shape属性的，所以刚开始你还改不了这个输入shape，但是C++ API有啊，所以才需要在刚才开头的地方打补丁嘛。

def build_pre_process(args, pre_process_list, predictor):
    processor = predictor.preprocessor
    if args.use_npu:
        processor.static_shape_infer = True 
    for preprocess_item in pre_process_list:
        for key, value in preprocess_item.items():
            if key == "NormalizeImage":
               processor.set_normalize(value['mean'], value['std'], True)

def build_detecter_pool(args):
    pre_process_list = [{
        'NormalizeImage': {
            'std': [0.229, 0.224, 0.225],
            'mean': [0.485, 0.456, 0.406],
            'scale': '1./255.',
            'order': 'hwc'
        }
    }] # from Paddle OCR
    ratio_list = [1/8, 1/4, 1/2, 1, 2, 4, 8, 16]
    # resolution H x W
    resolution_list = [[2048, 256], [1280, 320], [1920, 960], [1280, 1280], 
                       [960, 1920], [320, 1280], [256, 2048], [160, 2560]]
    predictors = []
    option, det_model_file, det_params_file = utility.create_option(args, 'det', logger)
    for i, ratio in enumerate(self.ratio_list):
        predictor = DBDetector(det_model_file, det_params_file, runtime_option=option)
        build_pre_process(args, pre_process_list, predictor)
        resolution = self.resolution_list[i]
        predictor.preprocessor.det_image_shape = [3, resolution[0], resolution[1]]
        predictors.append(predictor)
    return predictors

DBDetector具体的调用方法可以看官方示例代码。

在启动了16个rec模型和8个det模型后，占用显存1.16G，哪怕在昇腾310上也还好。
在npu-smi中看到的HBM占用

表格识别和还原模块的改造

PaddleOCR 2.6 实现了表格结构还原模型，之后4th Hackthon中的一次PR在FastDeploy中引入了表格结构还原模型。不过这一块除了没有在Python API提供返回值以外，目前还没有官方示例代码，得根据c++代码摸索着写。

表格结构模型

当前这个模型只能跑在CPU上，不过问题不大，忽略NPU 相关的Flag就好。

from fastdeploy.vision.ocr import StructureV2Table

def build_structurer(args):
    table_model_file, table_params_file, _ = utility.create_option(args, 'table', logger)
    # 手动创建CPU推理的option
    option = fastdeploy.RuntimeOption()
    option.set_cpu_thread_num(args.cpu_threads if hasattr(args, "cpu_threads") else 8)
    return StructureV2Table(table_model_file, table_params_file, args.table_char_dict_path, option)

推理时返回结果跟原来的结果比较一致。

def predict(predictor, img):
    ocr_result = predictor.predict(img)
    bbox_list = ocr_result.table_boxes
    structure_str_list = ocr_result.table_structure
    bbox_list = np.array(bbox_list, np.float32)
    structure_str_list = ['<html>', '<body>', '<table>'] + structure_str_list + ['</table>', '</body>', '</html>']
    return (structure_str_list, bbox_list)

这样就可以对应修改 ppstructure\table/predict_structure.py 了

表格文字识别流程

本来，ppstructure\table/predict_table.py这一套流程没啥可以修改的，但是实际测试下来发现，Python代码OCR结果下降得太厉害了。排查发现，是切割图象的时候出现的问题。需要修改TableSystem的_ocr()方法。

    def _ocr(self, img):
        ...
        img_crop_list = []
        for i in range(len(dt_boxes)):
            det_box = dt_boxes[i]
            x0, y0, x1, y1 = expand(2, det_box, img.shape)
            text_rect = img[int(y0):int(y1), int(x0):int(x1), :]
            img_crop_list.append(text_rect)
        ...

改为

from tools.infer.utility import get_rotate_crop_image

    def _ocr(self, img):
        ...
        img_crop_list = []
        crop_img = copy.deepcopy(img)
        for i in range(len(dt_boxes)):
            det_box = dt_boxes[i]
            x0, y0, x1, y1 = expand(2, det_box, img.shape)
            text_rect = get_rotate_crop_image(img, np.array([[x0, y0], [x1, y0], [x1, y1], [x0, y1]], np.float32))
            img_crop_list.append(text_rect)
        ...

换了切图方法，效果立刻提升了。

编译 Paddle-Lite （探索中）

FastDeploy 默认的打包昇腾NPU还是基于CANN5.1.RC2.alpha001版本的，并且每次跑的时候都会提示“CANN version mismatch. The build version is 0.0.0, but the current environment version is x.x.x.”。

FastDeploy 1.0.7 需要 Paddle-Lite 2.12以上版本。

由于不是交叉编译、需要指定gcc和g++，覆盖掉armlinux里面的默认配置。

C++

export CC=/opt/rh/devtoolset-7/root/usr/bin/gcc
export CXX=/opt/rh/devtoolset-7/root/usr/bin/g++
bash ./lite/tools/build_linux.sh --arch=armv8 --toolchain=gcc --with_extra=ON --with_log=ON --with_exception=ON --with_nnadapter=ON \
     --nnadapter_with_huawei_ascend_npu=ON --nnadapter_huawei_ascend_npu_sdk_root=/usr/local/Ascend/ascend-toolkit/latest --nnadapter_huawei_ascend_npu_sdk_version=5.1.RC2 full_publish

Python

export CC=/opt/rh/devtoolset-7/root/usr/bin/gcc
export CXX=/opt/rh/devtoolset-7/root/usr/bin/g++
bash ./lite/tools/build_linux.sh --arch=armv8 --toolchain=gcc --with_extra=ON --with_log=ON --with_exception=ON --with_nnadapter=ON --with_python=ON --python_version=3.7 --nnadapter_with_huawei_ascend_npu=ON \
     --nnadapter_huawei_ascend_npu_sdk_root=/usr/local/Ascend/ascend-toolkit/latest --nnadapter_huawei_ascend_npu_sdk_version=5.1.RC2 full_publish

接下来，需要把 build.lite.linux.armv8.gcc/inference_lite_lib.armlinux.armv8/ 中的包打包为 FastDeploy 支持的格式。

tar -czvf Python.inference_lite_lib.ubuntu.armv8.huawei_ascend_npu.tar.gz -C build.lite.linux.armv8.gcc/inference_lite_lib.armlinux.armv8/cxx/ .

把生成的tar上传到自己的HTTP服务器，再配置环境变量：

export PADDLELITE_URL="http://xxx/yyy/Python.inference_lite_lib.ubuntu.armv8.huawei_ascend_npu.tar.gz"