pdfminer转换PDF为html,【记录】尝试使用pyPdf将不可复制的PDF转换为文本或HTML

【背景】

折腾:

期间,去试试使用pyPdf去把一个不可复制的PDF文件,转换为文本或HTML。

【折腾过程】

1.参考:

去找到:

并下载:

2.但是安装时找不到Python:

73293fe1d1666f30910866e246a6b145.png

看来是:

我此处安装的x64的python,此处无法识别啊。。。

3.重新下载:

然后去解压安装:D:\tmp\dev_tools\python\pdf\pyPdf-1.13\pyPdf-1.13>python setup.py install

running install

running build

running build_py

creating build

creating build\lib

creating build\lib\pyPdf

copying pyPdf\filters.py -> build\lib\pyPdf

copying pyPdf\generic.py -> build\lib\pyPdf

copying pyPdf\pdf.py -> build\lib\pyPdf

copying pyPdf\utils.py -> build\lib\pyPdf

copying pyPdf\xmp.py -> build\lib\pyPdf

copying pyPdf\__init__.py -> build\lib\pyPdf

running install_lib

creating D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf

copying build\lib\pyPdf\filters.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf

copying build\lib\pyPdf\generic.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf

copying build\lib\pyPdf\pdf.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf

copying build\lib\pyPdf\utils.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf

copying build\lib\pyPdf\xmp.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf

copying build\lib\pyPdf\__init__.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf

byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf\filters.py to filters.pyc

byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf\generic.py to generic.pyc

byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf\pdf.py to pdf.pyc

byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf\utils.py to utils.pyc

byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf\xmp.py to xmp.pyc

byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf\__init__.py to __init__.pyc

running install_egg_info

Writing D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf-1.13-py2.7.egg-info

然后去试试。#!/usr/bin/python

# -*- coding: utf-8 -*-

"""

Function:

【未解决】将不可拷贝复制的PDF中的表格数据导出并转换为xml格式数据

https://www.crifan.com/non_copy_pdf_table_data_export_to_xml

Author: Crifan Li

Version: 2014-01-26

Contact: https://www.crifan.com/about/me

"""

import os

import glob

from pyPdf import PdfFileReader

def pdf_table_to_xml():

"""Operate PDF file, extract table data, save to xml"""

parent = "D:/tmp/tmp_dev_root/python/answer_question/self/pdf_table_to_xml/pdf"

os.chdir(parent)

pdfFilename = "spec183r21.0.pdf";

filename = os.path.abspath(pdfFilename)

input = PdfFileReader(file(filename, "rb"))

for page in input.pages:

print page.extractText()

if __name__ == "__main__":

pdf_table_to_xml();

结果运行出错,说是没解密:D:\tmp\tmp_dev_root\python\answer_question\self\pdf_table_to_xml>pdf_table_to_xml.py

Traceback (most recent call last):

File "D:\tmp\tmp_dev_root\python\answer_question\self\pdf_table_to_xml\pdf_table_to_xml.py", line 29, in

pdf_table_to_xml();

File "D:\tmp\tmp_dev_root\python\answer_question\self\pdf_table_to_xml\pdf_table_to_xml.py", line 25, in pdf_table_to_xml

for page in input.pages:

File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pyPdf\utils.py", line 78, in __getitem__

len_self = len(self)

File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pyPdf\utils.py", line 73, in __len__

return self.lengthFunction()

File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pyPdf\pdf.py", line 431, in getNumPages

self._flatten()

File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pyPdf\pdf.py", line 596, in _flatten

catalog = self.trailer["/Root"].getObject()

File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pyPdf\generic.py", line 480, in __getitem__

return dict.__getitem__(self, key).getObject()

File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pyPdf\generic.py", line 165, in getObject

return self.pdf.getObject(self).getObject()

File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pyPdf\pdf.py", line 655, in getObject

raise Exception, "file has not been decrypted"

Exception: file has not been decrypted

4.然后再去解决上述问题:

没找到解决办法。

其中:

说是,其代码对于其他pdf正常,所以无视此bug。。。

【总结】

目前也是无法通过pyPdf将上述不可拷贝的pdf转换为想要的文本或html。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值