pdfplumber和pdfminer.six提取PDF中文本行内容及对应坐标

引言
  • 最近在做PDF文件的解析,对于在PDF阅读器中可以直接复制的PDF文件,同样,也可以由代码直接解析提取出来对应文本
  • 经过一系列调研,发现用的最多的两个库为:pdfplumberpdfminer.six
  • 以下分别介绍这两个库如何有效提取PDF中文本行内容以及对应坐标
  • 示例PDF文件的下载链接
pdfplumber提取方案
  • 官方repo: jsvine/pdfplumber
  • 说明文档即是该仓库下的README文件
  • 运行代码:
    import pdfplumber
    
    pdf_path = 'hung2019.pdf'
    with pdfplumber.open(pdf_path) as pdf:
        first_page = pdf.pages[0]
        result = first_page.extract_words(x_tolerance=1, keep_blank_chars=True)
        for value in result:
            print(value['text'])
    
  • 部分输出结果:
    Malware
    detection
    based
    on
    directed
    multi-edge
    dataflow
    graph
    representation
    and
    convolutional
    neural
    network
    
  • 由以上结果可见,即使设置了keep_blank_chars=True,仍不能很好提取出每一行内容。不过,还有一些超参数可以调节,例如x_tolerancey_tolerance等等。我反正是试了好多,都不得行。
pdfminer.six提取方案
安装
pip install pdfminer.six
pip install pdf2image
运行环境版本信息
pdf2image                  1.16.3
pdfminer.six               20220524
pdfplumber                 0.7.5
python 				       3.10.13
比较复杂版本
  • 运行代码:
    from pdfminer.pdfparser import PDFParser
    from pdfminer.pdfdocument import PDFDocument
    from pdfminer.pdfpage import PDFPage
    from pdfminer.pdfpage import PDFTextExtractionNotAllowed
    from pdfminer.pdfinterp import PDFResourceManager
    from pdfminer.pdfinterp import PDFPageInterpreter
    from pdfminer.pdfdevice import PDFDevice
    from pdfminer.layout import *
    from pdfminer.converter import PDFPageAggregator
    
    pdf_path = 'hung2019.pdf'
    f = open(pdf_path, 'rb')
    
    #来创建一个pdf文档分析器
    parser = PDFParser(f)
    
    #创建一个PDF文档对象存储文档结构
    document = PDFDocument(parser)
    document.is_extractable
    
    # 创建一个PDF资源管理器对象来存储共赏资源
    rsrcmgr = PDFResourceManager()
    
    # 设定参数进行分析
    laparams = LAParams()
    
    # 创建一个PDF设备对象
    device = PDFPageAggregator(rsrcmgr,laparams=laparams)
    
    # 创建一个PDF解释器对象
    interpreter = PDFPageInterpreter(rsrcmgr,device)
    
    # 处理每一页
    for page in PDFPage.create_pages(document):
        interpreter.process_page(page)
        
        # 接受该页面的LTPage对象
        layout = device.get_result()
        page_height = layout.bbox[3]
        
        for x in layout:
            if isinstance(x, LTTextBox):
                for v in x:
                    if isinstance(v, LTTextLine):
                        text = v.get_text()
                        x0, y0, x1, y1 = v.bbox 
                        
                        # 注意这里的bbox y轴坐标需要用page 高度减去才是 正常坐标
                        y0 = page_height - y0
                        y1 = page_height - y1
                        print(f'{text}\t({x0}, {y0}, {x1}, {y1})')
    f.close()
    
  • 部分输出内容
    Malware detection based on directed multi-edge
    	(69.517, 77.95079429999998, 542.4866443, 54.04049429999998)
    dataflow graph representation and convolutional
    	(71.142, 105.84579429999997, 540.8598435, 81.93549429999996)
    neural network
    	(232.883, 133.74179430000004, 379.11839480000003, 109.83149430000003)
    Nguyen Viet Hung
    	(105.702, 161.48445089999996, 190.63347499999998, 150.52555089999998)
    Le Quy Don Techincal University
    	(81.264, 173.5719019999999, 218.55859060000006, 163.60930199999996)
    Faculty of Information Technology
    	(76.013, 185.899902, 216.83435100000005, 175.93730200000005)
    
高阶函数版本
  • 运行代码:
    from pdfminer.high_level import extract_pages
    from pdfminer.layout import LTPage, LTTextBoxHorizontal, LTTextLineHorizontal
    
    pdf_path = 'hung2019.pdf'
    pages = list(extract_pages(pdf_path))
    
    # 示例,取第一页
    page = pages[0]
    boxes, texts = [], []
     if isinstance(page, LTPage):
           for text_box_h in page:
               if isinstance(text_box_h, LTTextBoxHorizontal):
                   for text_box_h_l in text_box_h:
                       if isinstance(text_box_h_l, LTTextLineHorizontal):
                           x0, y0, x1, y1 = text_box_h_l.bbox
                           y0 = page.height - y0
                           y1 = page.height - y1
    
                           text = text_box_h_l.get_text()
                           boxes.append([[x0, y0], [x1, y0],
                                              [x1, y1], [x0, y1]])
                           texts.append(text)
                           
                           print(f'{text}\t({x0}, {y0}, {x1}, {y1})')
    
  • 部分输出结果:
    Malware detection based on directed multi-edge
       (69.517, 77.95079429999998, 542.4866443, 54.04049429999998)
    dataflow graph representation and convolutional
       (71.142, 105.84579429999997, 540.8598435, 81.93549429999996)
    neural network
       (232.883, 133.74179430000004, 379.11839480000003, 109.83149430000003)
    Nguyen Viet Hung
       (105.702, 161.48445089999996, 190.63347499999998, 150.52555089999998)
    Le Quy Don Techincal University
       (81.264, 173.5719019999999, 218.55859060000006, 163.60930199999996)
    Faculty of Information Technology
       (76.013, 185.899902, 216.83435100000005, 175.93730200000005)
    
可视化

⚠️注意:该库提取的文字坐标都是基于PDF转图像时dpi=72时计算得来的。这一点可以使用pdf2image库来验证。

from pdf2image import convert_from_path
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTFigure, LTTextBoxHorizontal, LTTextLineHorizontal
from PIL import ImageDraw

pdf_path = 'test_files/tiny.pdf'

image = convert_from_path(pdf_path, dpi=72)

img = image[0]
draw = ImageDraw.Draw(img)

pages = list(extract_pages(pdf_path))

for page_layout in extract_pages(pdf_path):
    height = page_layout.height
    for element in page_layout:
        if isinstance(element, LTTextBoxHorizontal):
            for text_box_h_l in element:
                if isinstance(text_box_h_l, LTTextLineHorizontal):
                    # 注意这里bbox的返回值是left,bottom,right,top
                    left, bottom, right, top = text_box_h_l.bbox

                    # 注意 bottom和top是距离页面底部的坐标值,
                    # 需要用当前页面高度减当前坐标值,才是以左上角为原点的坐标
                    bottom = height - bottom
                    top = height - top
                    text = text_box_h_l.get_text()

                    x0, y0 = left, top
                    x1, y1 = right, bottom
                    draw.rectangle([(x0, y0), (x1, y1)], outline=(255, 0, 0))
    img.save('res.png')

可视化效果:
在这里插入图片描述

总结
  • 由以上结果可以看出,pdfminer.six库有着比pdfplumber更加好的效果,同时也更加灵活。
参考文献
  • 3
    点赞
  • 19
    收藏
    觉得还不错? 一键收藏
  • 4
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值