pdf文档转html

最新推荐文章于 2024-07-01 08:59:26 发布

ty-boy

最新推荐文章于 2024-07-01 08:59:26 发布

阅读量602

点赞数

分类专栏： docker 文章标签： docker linux python pip centos

本文链接：https://blog.csdn.net/weixin_45313105/article/details/105052676

版权

docker 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

前言

在日常的一些工作中，偶尔也需要我们把图片转换为文字。目前大部分办公软件还无法实现类似的功能，因此遇到类似的问题时，我们只能自己动手解决。同时由于业务要求，需要将采集到的pdf批量转化html,对于标题，可以对该pdf相关内容进行截图，识别图片上的文字，进行标题匹配。需要实现这些功能，需要以下几个工具，chrome和chromedrive，版本需要一致。tesseract-ocr,需要安装中英文的语言包，和一个pdf转html的工具。同时需要安装python3，我们这里采用docker+rancher部署

软件	版本/下载地址
chrome浏览器	https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm
chromedriver	http://npm.taobao.org/mirrors/chromedriver/80.0.3987.106/chromedriver_linux64.zip
tesseract-ocr	https://github.com/tesseract-ocr/tesseract/archive/4.1.1.zip
Leptonica	http://www.leptonica.org/source/leptonica-1.79.0.tar.gz
Leptonica-github	https://github.com/DanBloomberg/leptonica/tree/v1.74.3#the-library-supports-many-operations-that-are-useful-on
Leptonica-语言包	https://github.com/tesseract-ocr/tessdata
Leptonica -语言包_中	https://github.com/tesseract-ocr/tessdata/blob/master/chi_sim.traineddata
Leptonica -语言包_英	https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata
pdf2htmlex	bwits/pdf2htmlex-alpine:latest
python版本	python3

正文

Dockerfile书写：

FROM  centos:7.6.1810
WORKDIR /usr/local/share/
LABEL ANTHOR="IVAN DU" VERSION="2003.03.09" BUILD_DATE="2020-03-9" \
      RESOURCES="https://github.com/tesseract-ocr/tesserac http://www.leptonica.org/index.html https://github.com/tesseract-ocr/tessdata"
ENV   LD_LIBRARY_PATH="/usr/local/lib" \
      LIBLEPT_HEADERSDIR="/usr/local/include" \
      PKG_CONFIG_PATH="/usr/local/lib/pkgconfig" \
      TESSDATA_PREFIX="/usr/local/share/tessdata" 
ADD   tesseract-4.1.1.tar.gz leptonica-1.79.0.tar.gz Python-3.7.4.tgz /
ADD   chromedriver google-chrome-stable_current_x86_64.rpm htmltoPDF.yml /usr/local/share/
COPY  chi_sim.traineddata eng.traineddata /usr/local/share/tessdata/
#COPY  kubernetes.repo /etc/yum.repos.d/
RUN   mv /etc/yum.repos.d/CentOS-Base.repo /etc/yum.repos.d/CentOS-Base.repo_bak \
      && curl -o /etc/yum.repos.d/CentOS-Base.repo http://mirrors.aliyun.com/repo/Centos-7.repo \
      && curl -o /etc/yum.repos.d/epel-7.repo http://mirrors.aliyun.com/repo/epel-7.repo \
      && yum clean all && yum makecache -y \
      && yum install -y file automake libjpeg-devel libpng-devel libtiff-devel zlib-devel libtool gcc-c++ make bzip2-devel openssl-devel openssl-static ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-devel libffi-devel lzma gcc lsb wget \
#      && yum install -y --nogpgcheck kubectl \
#      && mkdir -p /root/.kube \
#      && wget -P /root/.kube http://ftp_address/shared/docker/kubectl \
      && echo export CHROMEDRIVER_PATH="/usr/local/share" >> /root/.bashrc \
      && echo export PATH=\$PATH:\$CHROMEDRIVER_PATH >> /root/.bashrc \
      && cd /leptonica-1.79.0 && ./configure && make && make install \
      && cd /tesseract-4.1.1 && ./autogen.sh && ./configure && make && make install \
      && cd /Python-3.7.4 && ./configure --prefix=/usr/local/python3  && make && make install \
      && ln -s /usr/local/python3/bin/python3  /usr/bin/python3 \
      && ln -s /usr/local/python3/bin/pip3 /usr/bin/pip \
      && pip install -i https://mirrors.aliyun.com/pypi/simple --upgrade pip \
      && pip install -i https://mirrors.aliyun.com/pypi/simple bs4==0.0.1 \
      && pip install -i https://mirrors.aliyun.com/pypi/simple pymongo==3.9.0 \
      && pip install -i https://mirrors.aliyun.com/pypi/simple hdfs==2.5.8 \
      && pip install -i https://mirrors.aliyun.com/pypi/simple selenium==3.141.0 \
      && pip install -i https://mirrors.aliyun.com/pypi/simple pytesseract==0.3.1 \
      && pip install -i https://mirrors.aliyun.com/pypi/simple lxml==4.4.2 \
      && pip install -i https://mirrors.aliyun.com/pypi/simple PyPDF2==1.26.0 \
      && pip install -i https://mirrors.aliyun.com/pypi/simple requests_toolbelt==0.9.1 \
      && pip install -i https://mirrors.aliyun.com/pypi/simple yagmail==0.11.224 \
      && pip install -i https://mirrors.aliyun.com/pypi/simple elasticsearch==7.5.1 \
      && pip install -i https://mirrors.aliyun.com/pypi/simple fitz  \
      && pip install -i https://mirrors.aliyun.com/pypi/simple PyMuPDF==1.16.10  \
      && pip install -i https://mirrors.aliyun.com/pypi/simple pikepdf  \
      && pip install -i https://mirrors.aliyun.com/pypi/simple opencv \
      && cd /usr/local/share/ \
      && chmod +x chromedriver
      && mkdir py \
      && wget -P /usr/local/share/py -c -r -np -nd -k -L -p -A .py -R index.html* http://ftp_address/pdf_parse/ \    ##在ftp服务器上下载python代码
      && yum localinstall -y google-chrome-stable_current_x86_64.rpm \
      && echo "--no-sandbox --disable-dev-shm-usage" >>  /opt/google/chrome/google-chrome \
      && rm -rf chromedriver_linux64.zip google-chrome-stable_current_x86_64.rpm /leptonica-1.79.0 /tesseract-4.1.1 /Python-3.7.4 \
      && mkdir /pdf

docker image打包：

docker build -f dockerfile-tesseract -t harbor_address:ports/dir/name:tag_numb .
例如：docker build -f dockerfile-tesseract -t 192.168.10.10:88/target/tesseract-py:1 .
#查看打包的镜像：
docker images
#运行测试镜像：
docker run -itd --name test-tesseract harbor_address:ports/dir/name:tag_numb
#连接进入容器内：
docker ps #查看镜像ID
docker exec -it CONTAINER_ID /bin/bash
#上传镜像至harbor
docker push harbor_address:ports/dir/name:tag_numb

rancher 部署

在这里插入图片描述

关于在容器执行宿主机docker命令，参考：

执行上述已经映射宿主机的docker命令，启动bwits/pdf2htmlex-alpine容器，并指定执行的命令与参数。

docker run -i --rm -v /data/0466e08c6ab707a9814755e33a9b2ba6:/pdf 172.19.79.6:88/fuyun/pdf2htmlex-alpine pdf2htmlEX --zoom 1.3 --split-pages 1 --page-filename page_%d.page 0466e08c6ab707a9814755e33a9b2ba6.pdf
#注意：其中/data为ceph rbd挂载目录 ，lpage_%d.page 指htm是以页的形式保存。

大致流程

宿主机docker-->dockr命令、库等映射到容器内--> 容器调用docker命令在宿主机上创建容器-->宿主机容器为一次性，任务完成自动结束生命周期。 /data被俩个容器映射，一个是rancher内的容器(/pdf)映射，用来下载pdf文档，并保存在宿主机挂载的ceph_rbd上,一个是映射到宿主机创建的容器目录(/pdf)上,启动时把宿主机保存的pdf文档进行处理，并重新返回至宿主机目录中。

ty-boy

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
pdf文档转html

前言在日常的一些工作中，偶尔也需要我们把图片转换为文字。目前大部分办公软件还无法实现类似的功能，因此遇到类似的问题时，我们只能自己动手解决。同时由于业务要求，需要将采集到的pdf批量转化html,对于标题，可以对该pdf相关内容进行截图，识别图片上的文字，进行标题匹配。需要实现这些功能，需要以下几个工具，chrome和chromedrive，版本需要一致。tesseract-ocr,需要安装中英...
复制链接

扫一扫

专栏目录