【背景】
折腾:
期间,去试试使用pyPdf去把一个不可复制的PDF文件,转换为文本或HTML。
【折腾过程】
1.参考:
去找到:
并下载:
2.但是安装时找不到Python:
看来是:
我此处安装的x64的python,此处无法识别啊。。。
3.重新下载:
然后去解压安装:D:\tmp\dev_tools\python\pdf\pyPdf-1.13\pyPdf-1.13>python setup.py install
running install
running build
running build_py
creating build
creating build\lib
creating build\lib\pyPdf
copying pyPdf\filters.py -> build\lib\pyPdf
copying pyPdf\generic.py -> build\lib\pyPdf
copying pyPdf\pdf.py -> build\lib\pyPdf
copying pyPdf\utils.py -> build\lib\pyPdf
copying pyPdf\xmp.py -> build\lib\pyPdf
copying pyPdf\__init__.py -> build\lib\pyPdf
running install_lib
creating D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf
copying build\lib\pyPdf\filters.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf
copying build\lib\pyPdf\generic.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf
copying build\lib\pyPdf\pdf.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf
copying build\lib\pyPdf\utils.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf
copying build\lib\pyPdf\xmp.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf
copying build\lib\pyPdf\__init__.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf
byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf\filters.py to filters.pyc
byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf\generic.py to generic.pyc
byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf\pdf.py to pdf.pyc
byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf\utils.py to utils.pyc
byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf\xmp.py to xmp.pyc
byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf\__init__.py to __init__.pyc
running install_egg_info
Writing D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf-1.13-py2.7.egg-info
然后去试试。#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
Function:
【未解决】将不可拷贝复制的PDF中的表格数据导出并转换为xml格式数据
https://www.crifan.com/non_copy_pdf_table_data_export_to_xml
Author: Crifan Li
Version: 2014-01-26
Contact: https://www.crifan.com/about/me
"""
import os
import glob
from pyPdf import PdfFileReader
def pdf_table_to_xml():
"""Operate PDF file, extract table data, save to xml"""
parent = "D:/tmp/tmp_dev_root/python/answer_question/self/pdf_table_to_xml/pdf"
os.chdir(parent)
pdfFilename = "spec183r21.0.pdf";
filename = os.path.abspath(pdfFilename)
input = PdfFileReader(file(filename, "rb"))
for page in input.pages:
print page.extractText()
if __name__ == "__main__":
pdf_table_to_xml();
结果运行出错,说是没解密:D:\tmp\tmp_dev_root\python\answer_question\self\pdf_table_to_xml>pdf_table_to_xml.py
Traceback (most recent call last):
File "D:\tmp\tmp_dev_root\python\answer_question\self\pdf_table_to_xml\pdf_table_to_xml.py", line 29, in
pdf_table_to_xml();
File "D:\tmp\tmp_dev_root\python\answer_question\self\pdf_table_to_xml\pdf_table_to_xml.py", line 25, in pdf_table_to_xml
for page in input.pages:
File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pyPdf\utils.py", line 78, in __getitem__
len_self = len(self)
File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pyPdf\utils.py", line 73, in __len__
return self.lengthFunction()
File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pyPdf\pdf.py", line 431, in getNumPages
self._flatten()
File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pyPdf\pdf.py", line 596, in _flatten
catalog = self.trailer["/Root"].getObject()
File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pyPdf\generic.py", line 480, in __getitem__
return dict.__getitem__(self, key).getObject()
File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pyPdf\generic.py", line 165, in getObject
return self.pdf.getObject(self).getObject()
File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pyPdf\pdf.py", line 655, in getObject
raise Exception, "file has not been decrypted"
Exception: file has not been decrypted
4.然后再去解决上述问题:
没找到解决办法。
其中:
说是,其代码对于其他pdf正常,所以无视此bug。。。
【总结】
目前也是无法通过pyPdf将上述不可拷贝的pdf转换为想要的文本或html。