汇总如下
camelot
: 提取PDF中表格信息PyMuPDF
: 读取PDF,使用时import fitz
pdfplumber
: 提取PDF中信息,提供函数接口较少,且不是太准pdfminer
: 提取PDF工具,截止到2022年,已不再维护,不建议使用pdfminer.six
: 提取PDF内信息,基于pdfminer
而来。
PyMuPDF使用小技巧
- 读取PDF的几种方式,通过查阅官方文档,并没有显示找到
fitz.open
的函数说明,不过从pypi包的源码中倒是看到一二: - 源码解析:
# lib/python3.7/site-packages/fitz/fitz.py#L3887 class Document(object): thisown = property(lambda x: x.this.own(), lambda x, v: x.this.own(v), doc="The membership flag") __repr__ = _swig_repr __swig_destroy__ = _fitz.delete_Document def __init__(self, filename=None, stream=None, filetype=None, rect=None, width=0, height=0, fontsize=11): """Creates a document. Use 'open' as a synonym. Notes: Basic usages: open() - new PDF document open(filename) - string, pathlib.Path, or file object. open(filename, fileype=type) - overwrite filename extension. open(type, buffer) - type: extension, buffer: bytes object. open(stream=buffer, filetype=type) - keyword version of previous. Parameters rect, width, height, fontsize: layout reflowable document on open (e.g. EPUB). Ignored if n/a. """
- 使用示例:(仅列了两种,剩余的可参考上述函数注释部分中的说明)
import fitz pdf_path = '1.pdf' # 方式一: with fitz.open(pdf_path) as doc: text = doc[0].get_text() # 方式二: with open(pdf_path, 'rb') as pdf: data = pdf.read() with fitz.open(stream=data) as doc: text = doc[0].get_text()