引言
- 最近在做PDF文件的解析,对于在PDF阅读器中可以直接复制的PDF文件,同样,也可以由代码直接解析提取出来对应文本
- 经过一系列调研,发现用的最多的两个库为:pdfplumber 和 pdfminer.six
- 以下分别介绍这两个库如何有效提取PDF中文本行内容以及对应坐标
- 示例PDF文件的下载链接
pdfplumber提取方案
- 官方repo: jsvine/pdfplumber
- 说明文档即是该仓库下的README文件
- 运行代码:
import pdfplumber pdf_path = 'hung2019.pdf' with pdfplumber.open(pdf_path) as pdf: first_page = pdf.pages[0] result = first_page.extract_words(x_tolerance=1, keep_blank_chars=True) for value in result: print(value['text'])
- 部分输出结果:
Malware detection based on directed multi-edge dataflow graph representation and convolutional neural network
- 由以上结果可见,即使设置了
keep_blank_chars=True
,仍不能很好提取出每一行内容。不过,还有一些超参数可以调节,例如x_tolerance
和y_tolerance
等等。我反正是试了好多,都不得行。
pdfminer.six提取方案
安装
pip install pdfminer.six
pip install pdf2image
运行环境版本信息
pdf2image 1.16.3
pdfminer.six 20220524
pdfplumber 0.7.5
python 3.10.13
比较复杂版本
- 运行代码:
from pdfminer.pdfparser import PDFParser from pdfminer.pdfdocument import PDFDocument from pdfminer.pdfpage import PDFPage from pdfminer.pdfpage import PDFTextExtractionNotAllowed from pdfminer.pdfinterp import PDFResourceManager from pdfminer.pdfinterp import PDFPageInterpreter from pdfminer.pdfdevice import PDFDevice from pdfminer.layout import * from pdfminer.converter import PDFPageAggregator pdf_path = 'hung2019.pdf' f = open(pdf_path, 'rb') #来创建一个pdf文档分析器 parser = PDFParser(f) #创建一个PDF文档对象存储文档结构 document = PDFDocument(parser) document.is_extractable # 创建一个PDF资源管理器对象来存储共赏资源 rsrcmgr = PDFResourceManager() # 设定参数进行分析 laparams = LAParams() # 创建一个PDF设备对象 device = PDFPageAggregator(rsrcmgr,laparams=laparams) # 创建一个PDF解释器对象 interpreter = PDFPageInterpreter(rsrcmgr,device) # 处理每一页 for page in PDFPage.create_pages(document): interpreter.process_page(page) # 接受该页面的LTPage对象 layout = device.get_result() page_height = layout.bbox[3] for x in layout: if isinstance(x, LTTextBox): for v in x: if isinstance(v, LTTextLine): text = v.get_text() x0, y0, x1, y1 = v.bbox # 注意这里的bbox y轴坐标需要用page 高度减去才是 正常坐标 y0 = page_height - y0 y1 = page_height - y1 print(f'{text}\t({x0}, {y0}, {x1}, {y1})') f.close()
- 部分输出内容
Malware detection based on directed multi-edge (69.517, 77.95079429999998, 542.4866443, 54.04049429999998) dataflow graph representation and convolutional (71.142, 105.84579429999997, 540.8598435, 81.93549429999996) neural network (232.883, 133.74179430000004, 379.11839480000003, 109.83149430000003) Nguyen Viet Hung (105.702, 161.48445089999996, 190.63347499999998, 150.52555089999998) Le Quy Don Techincal University (81.264, 173.5719019999999, 218.55859060000006, 163.60930199999996) Faculty of Information Technology (76.013, 185.899902, 216.83435100000005, 175.93730200000005)
高阶函数版本
- 运行代码:
from pdfminer.high_level import extract_pages from pdfminer.layout import LTPage, LTTextBoxHorizontal, LTTextLineHorizontal pdf_path = 'hung2019.pdf' pages = list(extract_pages(pdf_path)) # 示例,取第一页 page = pages[0] boxes, texts = [], [] if isinstance(page, LTPage): for text_box_h in page: if isinstance(text_box_h, LTTextBoxHorizontal): for text_box_h_l in text_box_h: if isinstance(text_box_h_l, LTTextLineHorizontal): x0, y0, x1, y1 = text_box_h_l.bbox y0 = page.height - y0 y1 = page.height - y1 text = text_box_h_l.get_text() boxes.append([[x0, y0], [x1, y0], [x1, y1], [x0, y1]]) texts.append(text) print(f'{text}\t({x0}, {y0}, {x1}, {y1})')
- 部分输出结果:
Malware detection based on directed multi-edge (69.517, 77.95079429999998, 542.4866443, 54.04049429999998) dataflow graph representation and convolutional (71.142, 105.84579429999997, 540.8598435, 81.93549429999996) neural network (232.883, 133.74179430000004, 379.11839480000003, 109.83149430000003) Nguyen Viet Hung (105.702, 161.48445089999996, 190.63347499999998, 150.52555089999998) Le Quy Don Techincal University (81.264, 173.5719019999999, 218.55859060000006, 163.60930199999996) Faculty of Information Technology (76.013, 185.899902, 216.83435100000005, 175.93730200000005)
可视化
⚠️注意:该库提取的文字坐标都是基于PDF转图像时dpi=72
时计算得来的。这一点可以使用pdf2image
库来验证。
from pdf2image import convert_from_path
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTFigure, LTTextBoxHorizontal, LTTextLineHorizontal
from PIL import ImageDraw
pdf_path = 'test_files/tiny.pdf'
image = convert_from_path(pdf_path, dpi=72)
img = image[0]
draw = ImageDraw.Draw(img)
pages = list(extract_pages(pdf_path))
for page_layout in extract_pages(pdf_path):
height = page_layout.height
for element in page_layout:
if isinstance(element, LTTextBoxHorizontal):
for text_box_h_l in element:
if isinstance(text_box_h_l, LTTextLineHorizontal):
# 注意这里bbox的返回值是left,bottom,right,top
left, bottom, right, top = text_box_h_l.bbox
# 注意 bottom和top是距离页面底部的坐标值,
# 需要用当前页面高度减当前坐标值,才是以左上角为原点的坐标
bottom = height - bottom
top = height - top
text = text_box_h_l.get_text()
x0, y0 = left, top
x1, y1 = right, bottom
draw.rectangle([(x0, y0), (x1, y1)], outline=(255, 0, 0))
img.save('res.png')
可视化效果:
总结
- 由以上结果可以看出,pdfminer.six库有着比pdfplumber更加好的效果,同时也更加灵活。