python pdf处理图片_Python 将pdf转换成txt（不处理图片）

最新推荐文章于 2022-04-16 09:26:48 发布

weixin_39790102

最新推荐文章于 2022-04-16 09:26:48 发布

阅读量147

点赞数

文章标签： python pdf处理图片

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_39790102/article/details/111417589

版权

上一篇文章中已经介绍了简单的python爬网页下载文档，但下载后的文档多为doc或pdf，对于数据处理仍然有很多限制，所以将doc／pdf转换成txt显得尤为重要。查找了很多资料，在linux下要将doc转换成txt确实有难度，所以考虑先将pdf转换成txt。

师兄推荐使用PDFMiner来处理，尝试了一番，确实效果不错，在此和大家分享。

PDFMiner 的简介：PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data.有兴趣的同学请通过官网进行详细查看，通过PDFMiner中的小工具pdf2txt.py，便能将pdf转换成txt，而且仍保留pdf中的格式，超赞！

阅读pdf2txt.py的源码，我们可以看到具体的实现步骤，为了以后能处理大规模的pdf文件，这里我们只提取出pdf转换成txt的部分，具体实现代码如下：

# -*- coding: utf-8 -*-

#-----------------------------------------------------

# 功能：将pdf转换成txt(不处理图片)

# 作者：chenbjin

# 日期：2014-07-11

# 语言：Python 2.7.6

# 环境：linux(ubuntu)

# PDFMiner20140328(Must be installed)

# 使用：python pdf2txt.py file.pdf

#-----------------------------------------------------

import sys

from pdfminer.pdfinterp import PDFResourceManager,PDFPageInterpreter

from pdfminer.converter import TextConverter

from pdfminer.layout import LAParams

from pdfminer.pdfpage import PDFPage

#main

def main(argv) :

#输出文件名，这里只处理单文档，所以只用了argv［1］

outfile = argv[1] + '.txt'

args = [argv[1]]

debug = 0

pagenos = set()

password = ''

maxpages = 0

rotation = 0

codec = 'utf-8' #输出编码

caching = True

imagewriter = None

laparams = LAParams()

#

PDFResourceManager.debug = debug

PDFPageInterpreter.debug = debug

rsrcmgr = PDFResourceManager(caching=caching)

outfp = file(outfile,'w')

#pdf转换

device = TextConverter(rsrcmgr, outfp, codec=codec, laparams=laparams,

imagewriter=imagewriter)

for fname in args:

fp = file(fname,'rb')

interpreter = PDFPageInterpreter(rsrcmgr, device)

#处理文档对象中每一页的内容

for page in PDFPage.get_pages(fp, pagenos,

maxpages=maxpages, password=password,

caching=caching, check_extractable=True) :

page.rotate = (page.rotate+rotation) % 360

interpreter.process_page(page)

fp.close()

device.close()

outfp.close()

return

if __name__ == '__main__' : main(sys.argv)

下一步将尝试将pdf中的图片进行转换，可以通过http://denis.papathanasiou.org/2010/08/04/extracting-text-images-from-pdf-files/ 进行了解。

参考资料：

1.PDFMiner：http://www.unixuser.org/~euske/python/pdfminer/

weixin_39790102

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。