pdfminer html标签,python - PDFminer - Is there a way to convert pdf into html from pdfminer? - Stack O...

博主在尝试使用PDFMiner将PDF文件转换为HTML格式时遇到问题。他们分享了两个代码片段,但都无法成功运行。第一个代码片段运行后没有输出HTML文件或任何结果,第二个代码片段因为`pdfminer.pdfinterp`模块中无法导入`process_pdf`函数导致错误。博客内容涉及到Python 3.7环境下PDFMiner的具体用法和技术细节。
摘要由CSDN通过智能技术生成

Is a simple way to convert pdf to html using pdfminer?

I have seen many questions like this but they won't give me a right answer...

I have entered this in my ConEmu prompt:

# pdf2txt.py -o output.html -t html sample.pdf

usage: C:\Program Files\Python37-32\Scripts\pdf2txt.py [-P password] [-o output] [-t text|html|xml|tag] [-O output_dir] [-c encoding] [-s scale] [-R rotation] [-Y normal|loose|exact] [-p pagenos] [-m maxpages] [-S] [-C] [-n] [-A] [-V] [-M char_margin] [-L line_margin] [-W word_margin] [-F boxes_flow] [-d] input.pdf ...

I hope that is not the response i should get from pdf2txt.py..

Is there any code snippet that will work?

I have tried this:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter

from pdfminer.converter import HTMLConverter

from pdfminer.layout import LAParams

from pdfminer.pdfpage import PDFPage

from io import BytesIO

def convert_pdf_to_html(path):

rsrcmgr = PDFResourceManager()

retstr = BytesIO()

codec = 'utf-8'

laparams = LAParams()

device = HTMLConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)

fp = open(path, 'rb')

interpreter = PDFPageInterpreter(rsrcmgr, device)

password = ""

maxpages = 0 #is for all

caching = True

pagenos=set()

for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):

interpreter.process_page(page)

fp.close()

device.close()

str = retstr.getvalue()

retstr.close()

return str

test = convert_pdf_to_html('E://sample.pdf')

But it didn't give me any html file nor any output

and another code:

import pdfminer

from pdfminer.pdfinterp import PDFResourceManager, process_pdf

from pdfminer.converter import HTMLConverter, TextConverter

from pdfminer.layout import LAParams

rsrcmgr = PDFResourceManager()

laparams = LAParams()

converter = HTMLConverter if format == 'html' else TextConverter

device = converter(rsrcmgr, out_file, codec='utf-8', laparams=laparams)

process_pdf(rsrcmgr, device, in_file, pagenos=[1,3,5], maxpages=9)

with contextlib.closing(tempfile.NamedTemporaryFile(mode='r', suffix='.xml')) as xmlin:

cmd = 'pdftohtml -xml -nodrm -zoom 1.5 -enc UTF-8 -noframes "%s" "%s"' % (

pdf_filename, xmlin.name.rpartition('.')[0])

os.system(cmd + " >/dev/null 2>&1")

result = xmlin.read().decode('utf-8')

It gives this:

Traceback (most recent call last):

File "E:\Blah\blah\blah.py", line 2, in

from pdfminer.pdfinterp import PDFResourceManager, process_pdf

ImportError: cannot import name 'process_pdf' from 'pdfminer.pdfinterp' (c:\program files\python37-32\lib\site-packages\pdfminer\pdfinterp.py)

Info:

System : Windows 7 SP-1 32-bit

Python : 3.7.0

PDFMiner : 20191125

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值