前言
在日常的一些工作中,偶尔也需要我们把图片转换为文字。目前大部分办公软件还无法实现类似的功能,因此遇到类似的问题时,我们只能自己动手解决。同时由于业务要求,需要将采集到的pdf批量转化html,对于标题,可以对该pdf相关内容进行截图,识别图片上的文字,进行标题匹配。需要实现这些功能,需要以下几个工具,chrome和chromedrive,版本需要一致。tesseract-ocr,需要安装中英文的语言包,和一个pdf转html的工具。同时需要安装python3,我们这里采用docker+rancher部署
软件 | 版本/下载地址 |
---|---|
chrome浏览器 | https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm |
chromedriver | http://npm.taobao.org/mirrors/chromedriver/80.0.3987.106/chromedriver_linux64.zip |
tesseract-ocr | https://github.com/tesseract-ocr/tesseract/archive/4.1.1.zip |
Leptonica | http://www.leptonica.org/source/leptonica-1.79.0.tar.gz |
Leptonica-github | https://github.com/DanBloomberg/leptonica/tree/v1.74.3#the-library-supports-many-operations-that-are-useful-on |
Leptonica-语言包 | https://github.com/tesseract-ocr/tessdata |
Leptonica -语言包_中 | https://github.com/tesseract-ocr/tessdata/blob/master/chi_sim.traineddata |
Leptonica -语言包_英 | https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata |
pdf2htmlex | bwits/pdf2htmlex-alpine:latest |
python版本 | python3 |
正文
Dockerfile书写:
FROM centos:7.6.1810
WORKDIR /usr/local/share/
LABEL ANTHOR="IVAN DU" VERSION="2003.03.09" BUILD_DATE="2020-03-9" \
RESOURCES="https://github.com/tesseract-ocr/tesserac http://www.leptonica.org/index.html https://github.com/tesseract-ocr/tessdata"
ENV LD_LIBRARY_PATH="/usr/local/lib" \
LIBLEPT_HEADERSDIR="/usr/local/include" \
PKG_CONFIG_PATH="/usr/local/lib/pkgconfig" \
TESSDATA_PREFIX="/usr/local/share/tessdata"
ADD tesseract-4.1.1.tar.gz leptonica-1.79.0.tar.gz Python-3.7.4.tgz /
ADD chromedriver google-chrome-stable_current_x86_64.rpm htmltoPDF.yml /usr/local/share/
COPY chi_sim.traineddata eng.traineddata /usr/local/share/tessdata/
#COPY kubernetes.repo /etc/yum.repos.d/
RUN mv /etc/yum.repos.d/CentOS-Base.repo /etc/yum.repos.d/CentOS-Base.repo_bak \
&& curl -o /etc/yum.repos.d/CentOS-Base.repo http://mirrors.aliyun.com/repo/Centos-7.repo \
&& curl -o /etc/yum.repos.d/epel-7.repo http://mirrors.aliyun.com/repo/epel-7.repo \
&& yum clean all && yum makecache -y \
&& yum install -y file automake libjpeg-devel libpng-devel libtiff-devel zlib-devel libtool gcc-c++ make bzip2-devel openssl-devel openssl-static ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-devel libffi-devel lzma gcc lsb wget \
# && yum install -y --nogpgcheck kubectl \
# && mkdir -p /root/.kube \
# && wget -P /root/.kube http://ftp_address/shared/docker/kubectl \
&& echo export CHROMEDRIVER_PATH="/usr/local/share" >> /root/.bashrc \
&& echo export PATH=\$PATH:\$CHROMEDRIVER_PATH >> /root/.bashrc \
&& cd /leptonica-1.79.0 && ./configure && make && make install \
&& cd /tesseract-4.1.1 && ./autogen.sh && ./configure && make && make install \
&& cd /Python-3.7.4 && ./configure --prefix=/usr/local/python3 && make && make install \
&& ln -s /usr/local/python3/bin/python3 /usr/bin/python3 \
&& ln -s /usr/local/python3/bin/pip3 /usr/bin/pip \
&& pip install -i https://mirrors.aliyun.com/pypi/simple --upgrade pip \
&& pip install -i https://mirrors.aliyun.com/pypi/simple bs4==0.0.1 \
&& pip install -i https://mirrors.aliyun.com/pypi/simple pymongo==3.9.0 \
&& pip install -i https://mirrors.aliyun.com/pypi/simple hdfs==2.5.8 \
&& pip install -i https://mirrors.aliyun.com/pypi/simple selenium==3.141.0 \
&& pip install -i https://mirrors.aliyun.com/pypi/simple pytesseract==0.3.1 \
&& pip install -i https://mirrors.aliyun.com/pypi/simple lxml==4.4.2 \
&& pip install -i https://mirrors.aliyun.com/pypi/simple PyPDF2==1.26.0 \
&& pip install -i https://mirrors.aliyun.com/pypi/simple requests_toolbelt==0.9.1 \
&& pip install -i https://mirrors.aliyun.com/pypi/simple yagmail==0.11.224 \
&& pip install -i https://mirrors.aliyun.com/pypi/simple elasticsearch==7.5.1 \
&& pip install -i https://mirrors.aliyun.com/pypi/simple fitz \
&& pip install -i https://mirrors.aliyun.com/pypi/simple PyMuPDF==1.16.10 \
&& pip install -i https://mirrors.aliyun.com/pypi/simple pikepdf \
&& pip install -i https://mirrors.aliyun.com/pypi/simple opencv \
&& cd /usr/local/share/ \
&& chmod +x chromedriver
&& mkdir py \
&& wget -P /usr/local/share/py -c -r -np -nd -k -L -p -A .py -R index.html* http://ftp_address/pdf_parse/ \ ##在ftp服务器上下载python代码
&& yum localinstall -y google-chrome-stable_current_x86_64.rpm \
&& echo "--no-sandbox --disable-dev-shm-usage" >> /opt/google/chrome/google-chrome \
&& rm -rf chromedriver_linux64.zip google-chrome-stable_current_x86_64.rpm /leptonica-1.79.0 /tesseract-4.1.1 /Python-3.7.4 \
&& mkdir /pdf
docker image打包:
docker build -f dockerfile-tesseract -t harbor_address:ports/dir/name:tag_numb .
例如:docker build -f dockerfile-tesseract -t 192.168.10.10:88/target/tesseract-py:1 .
#查看打包的镜像:
docker images
#运行测试镜像:
docker run -itd --name test-tesseract harbor_address:ports/dir/name:tag_numb
#连接进入容器内:
docker ps #查看镜像ID
docker exec -it CONTAINER_ID /bin/bash
#上传镜像至harbor
docker push harbor_address:ports/dir/name:tag_numb
rancher 部署
关于在容器执行宿主机docker命令,参考:
执行上述已经映射宿主机的docker命令,启动bwits/pdf2htmlex-alpine容器,并指定执行的命令与参数。
docker run -i --rm -v /data/0466e08c6ab707a9814755e33a9b2ba6:/pdf 172.19.79.6:88/fuyun/pdf2htmlex-alpine pdf2htmlEX --zoom 1.3 --split-pages 1 --page-filename page_%d.page 0466e08c6ab707a9814755e33a9b2ba6.pdf
#注意:其中/data为ceph rbd挂载目录 ,lpage_%d.page 指htm是以页的形式保存。
大致流程
宿主机docker-->dockr命令、库等映射到容器内--> 容器调用docker命令在宿主机上创建容器-->宿主机容器为一次性,任务完成自动结束生命周期。 /data被俩个容器映射,一个是rancher内的容器(/pdf)映射,用来下载pdf文档,并保存在宿主机挂载的ceph_rbd上,一个是映射到宿主机创建的容器目录(/pdf)上,启动时把宿主机保存的pdf文档进行处理,并重新返回至宿主机目录中。