python下提取PDF库总结

汇总如下
  • camelot: 提取PDF中表格信息
  • PyMuPDF: 读取PDF,使用时import fitz
    • 官方文档:pymupdf
    • 注意开源协议是AGPL 3.0
    • 不能提取每一行坐标信息,只能提取block信息,参见docs
  • pdfplumber: 提取PDF中信息,提供函数接口较少,且不是太准
  • pdfminer: 提取PDF工具,截止到2022年,已不再维护,不建议使用
  • pdfminer.six: 提取PDF内信息,基于pdfminer而来。
PyMuPDF使用小技巧
  • 读取PDF的几种方式,通过查阅官方文档,并没有显示找到fitz.open的函数说明,不过从pypi包的源码中倒是看到一二:
  • 源码解析:
    # lib/python3.7/site-packages/fitz/fitz.py#L3887
    class Document(object):
        thisown = property(lambda x: x.this.own(), lambda x, v: x.this.own(v), doc="The membership flag")
        __repr__ = _swig_repr
        __swig_destroy__ = _fitz.delete_Document
    
        def __init__(self, filename=None, stream=None, filetype=None, rect=None, width=0, height=0, fontsize=11):
    
            """Creates a document. Use 'open' as a synonym.
    
            Notes:
                Basic usages:
                open() - new PDF document
                open(filename) - string, pathlib.Path, or file object.
                open(filename, fileype=type) - overwrite filename extension.
                open(type, buffer) - type: extension, buffer: bytes object.
                open(stream=buffer, filetype=type) - keyword version of previous.
                Parameters rect, width, height, fontsize: layout reflowable
                     document on open (e.g. EPUB). Ignored if n/a.
            """
    
  • 使用示例:(仅列了两种,剩余的可参考上述函数注释部分中的说明)
    import fitz
    
    pdf_path = '1.pdf'
    
    # 方式一:
    with fitz.open(pdf_path) as doc:
        text = doc[0].get_text()
    
    # 方式二:
    with open(pdf_path, 'rb') as pdf:
        data = pdf.read()
    
    with fitz.open(stream=data) as doc:
        text = doc[0].get_text()
    
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Python is an extensible, interpreted, object-oriented programming language. It supports a wide range of applica- tions, from simple text processing scripts to interactive Web browsers. Python 是一种可扩展的, 即译式, 面向对象规格的编程语言. 它能应用在极广泛的地方, 从简单的文字处理 工作到交互式的网页浏览器. While the Python Reference Manual describes the exact syntax and semantics of the language, it does not describe the standard library that is distributed with the language, and which greatly enhances its immediate usability. This library contains built-in modules (written in C) that provide access to system functionality such as file I/O that would otherwise be inaccessible to Python programmers, as well as modules written in Python that provide standardized solutions for many problems that occur in everyday programming. Some of these modules are explicitly designed to encourage and enhance the portability of Python programs. Python 语言参考手册 中精确地描述了Python 语言的句法及语义. 然而语言参考手册中没有提到Python 所 附带功能强大的标准. 这个函式大大地增强了Python 的实用性. 其中包括C 写的内建模组, 提供介面 让程式进行操作系统层次的工作, 例如档案的输出输入; 同时也有以Python 语言本身编写的模组, 为实际 编程时常遇的问题提供标准解决方案. 这类模组有的经过特别设计以便Python 程式在跨平台的情况下运 行无误. This library reference manual documents Python’s standard library, as well as many optional library modules (which may or may not be available, depending on whether the underlying platform supports them and on the configuration choices made at compile time). It also documents the standard types of the language and its built-in functions and exceptions, many of which are not or incompletely documented in the Reference Manual. 本参考手册罗列并说明了Python 标准的各种功能, 以及许多非核心的模组(按不同的操作系统和编译时 的设置而定, 不是每台机上的Python 都能用这些模组.) 本手册同时记载了Python 语言所有的标准数据类 型, 内建函数, 异常类, 这些在参考手册中被忽略了或只是扼要的提过一下. This manual assumes basic knowledge about the Python language. For an informal introduction to Python, see the Python Tutorial; the Python Reference Manual remains the highest authority on syntactic and semantic questions. Finally, the manual entitled Extending and Embedding the Python Interpreter describes how to add new extensions to Python and how to embed it in other applications. 本手册的读者要对Python 有基本的认识. 初学者应该从Python 指南 开始. 至于Python 语言参考手册 则是该语言的语法和语义问题上的权威阐释. 最后 扩展或嵌入 Python 解释器 一文解说了如何在Python 中加入新的扩展模组; 以及怎样把Python 解释器嵌入到其他的应用程式中.

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值