Python利器 PDFMiner python实现PDF转换TXT（附代码）

最新推荐文章于 2023-09-22 13:24:44 发布

Mrchesian

最新推荐文章于 2023-09-22 13:24:44 发布

阅读量1.9w

点赞数

分类专栏： python 文章标签： python pdf pdf解析

python 专栏收录该内容

17 篇文章 0 订阅

订阅专栏

PDFMiner其特征有：
1、完全使用python编写。（适用于2.4或更新版本）
2、解析，分析，并转换成PDF文档。
3、PDF-1.7规范的支持。（几乎）
4、中日韩CJK语言和垂直书写脚本支持。
5、各种字体类型（Type1、TrueType、Type3，和CID）的支持。
6、基本加密（RC4）的支持。
7、PDF与HTML转换。
8、纲要（TOC）的提取。
9、标签内容提取。
10、通过分组文本块重建原始的布局。
如果你的Python有安装pip模块，就可以通过pip命令自动安装pdfminer。(不支持中文)

 #python pip install pdfminer

下面是 pdfminer 官网

Online Demo: (pdf -> html conversion webapp)
http://pdf2html.tabesugi.net:8080/
Source distribution:
http://pypi.python.org/pypi/pdfminer/
github:
https://github.com/euske/pdfminer/

Install Python 2.4 or newer. (Python 3 is not supported.)
Download the PDFMiner source.
Unpack it.
Run setup.py to install:

# python setup.py install

Do the following test:

$ pdf2txt.py samples/simple1.pdf
Hello
World
Hello
World
H e l l o
W o r l d
H e l l o
W o r l d
Done!

In order to process CJK languages, you need an additional step to take during installation:
（如果要使用中日韩CJK文字须先编译再安装：）

LINUX 下命令直接make
# make cmap
python tools/conv_cmap.py pdfminer/cmap Adobe-CNS1 cmaprsrc/cid2code_Adobe_CNS1.txt
reading ‘cmaprsrc/cid2code_Adobe_CNS1.txt’…
writing ‘CNS1_H.py’…
…
(this may take several minutes)

# python setup.py install

On Windows machines which don’t have make command, paste the following commands on a command line prompt:
Windows 下命令:

mkdir pdfminer\cmap
python tools\conv_cmap.py -c B5=cp950 -c UniCNS-UTF8=utf-8 pdfminer\cmap Adobe-CNS1 cmaprsrc\cid2code_Adobe_CNS1.txt
python tools\conv_cmap.py -c GBK-EUC=cp936 -c UniGB-UTF8=utf-8 pdfminer\cmap Adobe-GB1 cmaprsrc\cid2code_Adobe_GB1.txt
python tools\conv_cmap.py -c RKSJ=cp932 -c EUC=euc-jp -c UniJIS-UTF8=utf-8 pdfminer\cmap Adobe-Japan1 cmaprsrc\cid2code_Adobe_Japan1.txt
python tools\conv_cmap.py -c KSC-EUC=euc-kr -c KSC-Johab=johab -c KSCms-UHC=cp949 -c UniKS-UTF8=utf-8 pdfminer\cmap Adobe-Korea1 cmaprsrc\cid2code_Adobe_Korea1.txt
python setup.py install

附代码

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time    : 2017/7/6 21:02
# @Author  : chen
# @Site    : 
# @File    : simplePDF.py
# @Software: PyCharm
import os
from cStringIO import StringIO
import sys
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
def convert_pdf_2_text(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    device = TextConverter(rsrcmgr, retstr, codec='utf-8', laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    with open(path, 'rb') as fp:
        for page in PDFPage.get_pages(fp, set()):
            interpreter.process_page(page)
        text = retstr.getvalue()
    device.close()
    retstr.close()
    return text