pdfminer库解析,使用pdfminer进行信息抽取

最新推荐文章于 2024-01-30 14:00:45 发布

stevenjhjh

最新推荐文章于 2024-01-30 14:00:45 发布

阅读量3.7k

点赞数 3

文章标签： python

本文链接：https://blog.csdn.net/stevenjhjh/article/details/106804131

版权

pdfminer解析
首先给出pdfminer官网的说法,主要包含三张图片
这是pdfminer各个类之间的关系,首先使用PDFParser对文章解析,之后建立PDFDocument和PDFparser之间的关联
这张图描述的是解析出来的LTpage的各个内容,其包含识别出来的一个一个文本块(注意这里识别出来的是以空间为基础而不是逻辑上),一个LTPage包含多个LTTextBox文本块,每一个文本块又包含多个LTTextLine文本行,行内由各个字符组成.具体各个类型的含义如下

类型	含义
LTPage	表达一个完整的页面
LTTextBox	表达一个在矩形框中的各个文本块
LTTextLine	表达单独一行文本,由多个LTChar组成
LTChar/ LTAnno	表示一个真是的Unicode编码的字符
LTFigure	表示一个由PDF表格对象使用的区域
LTImage	表示一个图片对象,可以是JPEG等其它类型
LTRect	表示一个矩形区域,用于分割区别其它类型
LTLine	表示一个单独的文本行,用于作为其他图片文本之间的分割
LTCurve	表达一个图片中的曲线

简单的代码使用

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice


# Open a PDF file.
fp = open('mypdf.pdf', 'rb')

# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
# Supply the password for initialization.
document = PDFDocument(parser, password)
# Check if the document allows text extraction. If not, abort.
if not document.is_extractable:
    raise PDFTextExtractionNotAllowed
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Create a PDF device object.
device = PDFDevice(rsrcmgr)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
for page in PDFPage.create_pages(document):
    interpreter.process_page(page)


from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator


# Set parameters for analysis.
laparams = LAParams()
# Create a PDF page aggregator object.
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(document):
    interpreter.process_page(page)
    # receive the LTPage object for the page.
    layout = device.get_result()    # 这里的layout得到的就是上述的LTPage变量
for x in layout:
    if isinstance(x, LTTextBox): # 对layout所有相关属性进行遍历找出需要的属性进行相应处理,注意layout不可以使用下标访问,你可以循环一遍,同一放到一个列表里就可以了
             pass
'''
        a = []
        for x in layout:
            a.append(x)
        '''

stevenjhjh

关注

3
点赞
踩
10

收藏

觉得还不错? 一键收藏
3
评论
pdfminer库解析,使用pdfminer进行信息抽取

pdfminer解析首先给出pdfminer官网的说法,主要包含三张图片这是pdfminer各个类之间的关系,首先使用PDFParser对文章解析,之后建立PDFDocument和PDFparser之间的关联这张图描述的是解析出来的LTpage的各个内容,其包含识别出来的一个一个文本块(注意这里识别出来的是以空间为基础而不是逻辑上),一个LTPage包含多个LTTextBox文本块,每一个文本块又包含多个LTTextLine文本行,行内由各个字符组成.具体各个类型的含义如下类型含义.
复制链接

扫一扫