c# string 转 datetime_tesseract || PDF转PNG转txt

最新推荐文章于 2024-02-02 10:00:40 发布

weixin_39708708

最新推荐文章于 2024-02-02 10:00:40 发布

阅读量111

点赞数

由图片扫描生成的PDF文件无法复制粘贴文字，是不是很困扰？虽然有一些阅读器和在线工具可以解析图片中的文字，但毕竟只能线下处理，不能满足批量的系统线上功能。今天写一个PDF转图片、图片转txt的代码模块，你可以将它封装起来，实现线上识别pdf中的文字！01

关于tesseract：

(1)首先，pip install pytesseract；

(2)然后，需要下载 tesseract-ocr

下载网址：https://github.com/UB-Mannheim/tesseract/wiki选择自己的版本下载，下载之后直接安装即可。修改pytesseract.py 文件里面的指向路径为你的安装路径； ( 3)最后，将你的安装路径 .\Tesseract-OCR\tessdata 添加到环境变量中；

(4)其他还需要的安装包：fitz，PIL，可以直接通过pip或conda来安装，不赘述。(5)关于识别语言库的下载(原github失效中，但有热心网友提供了下载包，简体中文为chi_sim)： https://blog.csdn.net/qq_38161040/article/details/9072745602

PDF转PNG图片：

import fitzimport pytesseractfrom PIL import Imageimport datetimedef pdf_image(pdfPath,imgPath,zoom_x,zoom_y,rotation_angle):    '''    将PDF转化为png    pdfPath:pdf文件的路径    imgPath:图像要保存的文件夹    zoom_x: x方向的缩放系数    zoom_y: y方向的缩放系数    rotation_angle: 旋转角度    '''    # 打开PDF文件    pdf = fitz.open(pdfPath)    # 逐页读取PDF    for pg in range(0, pdf.pageCount):        page = pdf[pg]        # 设置缩放和旋转系数        trans = fitz.Matrix(zoom_x, zoom_y).preRotate(rotation_angle)        pm = page.getPixmap(matrix=trans, alpha=False)        # 开始写图像        pm.writePNG(imgPath+str(pg)+".png")    pdf.close()

PNG转txt：

def main():    '''    png图片转为txt    '''    for i in range(8):#假如有8页图片，分别为0.png，1.png...        starttime = datetime.datetime.now()        image = Image.open(r"C:/Users/Lenovo/Desktop/"+str(i)+".png")        text = pytesseract.image_to_string(image, lang='chi_sim')  # 使用简体中文解析图片        endtime = datetime.datetime.now()        text=text.replace(" ","")        with open(r"C:/Users/Lenovo/Desktop/"+str(i)+".txt", "a") as f: # 将识别出来的文字存到本地            # print(text)            f.write(str(text))            if __name__ == "__main__":       pdf_image(path,r"C:/Users/Lenovo/Desktop/",5,5,0)   main()

注：将路径换为你的pdf路径和图片路径

查看效果：

pdf