python pdf ocr识别_python – 如何使用OCR有效地从PDF文件目录中提取文本？

最新推荐文章于 2024-07-09 22:38:42 发布

weixin_39605191

最新推荐文章于 2024-07-09 22:38:42 发布

阅读量499

点赞数

文章标签： python pdf ocr识别

在你的代码中,你正在提取文本,但是你不会做任何事情.

尝试这样的东西：

def extract_txt(file_path):

text = textract.process(file_path, method='tesseract')

outfn = file_path[:-4] + '.txt' # assuming filenames end with '.pdf'

with open(outfn, 'wb') as output_file:

output_file.write(text)

return file_path

这将文本写入具有相同名称但扩展名为.txt的文件.

它还返回原始文件的路径,让父母知道此文件已完成.

所以我会将映射代码更改为：

p = multiprocessing.Pool()

file_path = ['/Users/user/Desktop/sample.pdf']

for fn in p.imap_unordered(extract_txt, file_path):

print('completed file:', fn)

>创建池时不需要提供参数.默认情况下,它将创建与cpu内核一样多的工作人员.

>使用imap_unordered创建一个迭代器,一旦它们可用就开始生成值.

>因为worker函数返回了文件名,所以你可以打印它来让用户知道这个文件是完成的.

编辑1：

另外一个问题是是否可以标记页边界.我觉得是这样的.

一个确定工作的方法是将PDF文件分割成OCR之前的页面.你可以使用例如pdfinfo从poppler-utils包中找出文档中的页数.然后你可以使用例如pdfs从同一个poppler-utils包中分离出来,将N页的一个pdf文件转换成一页的N个pdf文件.然后,您可以单独OCR单页PDF文件.这将分别给您每个页面上的文本.

或者,您可以OCR整个文档,然后搜索分页符.如果文档在每个页面上都有一个常量或可预测的页眉或页脚,这将只起作用.它可能不如上述方法那么可靠.

编辑2：

如果需要一个文件,请写一个文件：

from PyPDF2 import PdfFileWriter, PdfFileReader

import textract

def extract_text(pdf_file):

inputpdf = PdfFileReader(open(pdf_file, "rb"))

for i in range(inputpdf.numPages):

w = PdfFileWriter()

w.addPage(inputpdf.getPage(i))

outfname = 'page{:03d}.pdf'.format(i)

with open(outfname, 'wb') as outfile: # I presume you need `wb`.

w.write(outfile)

print('page', i)

text = textract.process(outfname, method='tesseract')

# Add header and footer.

text = '\n'.format(i) + text + '\n'.format(i)

# Write the OCR-ed text to a file for each page.

with open('page{:03d}.txt'.format(i), 'w') as textfile: # might need 'wb' depending on what textract puts out.

textfile.write(text)

os.remove(outfname) # clean up.

print(text)

weixin_39605191

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
python pdf ocr识别_python – 如何使用OCR有效地从PDF文件目录中提取文本？

在你的代码中,你正在提取文本,但是你不会做任何事情.尝试这样的东西：def extract_txt(file_path):text = textract.process(file_path, method='tesseract')outfn = file_path[:-4] + '.txt' # assuming filenames end with '.pdf'with open(outfn,...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。