（一）python编辑PDF文件：pdfplumber库

斋栩

已于 2022-04-09 20:37:20 修改

阅读量4.3k

点赞数 2

分类专栏： pdf 文章标签： python

于 2022-04-09 15:28:43 首次发布

原文链接：https://github.com/jsvine/pdfplumber#options

版权

pdf 专栏收录该内容

4 篇文章

订阅专栏

本文参考Github: pdfplumber 库

(一）安装（cmd运行）：

pip install pdfplumber

（二）类

顶层类：pdfplumber.PDF

核心类：pdfplumber.Page

（三）主要方法&功能简介

method:

.crop(bounding_box, relative=False)
.within_bbox(bounding_box, relative=False)
.dedupe_chars(tolerance=1)
.extract_text(x_tolerance=3, y_tolerance=3, layout=False, x_density=7.25, y_density=13, **kwargs)
.extract_words(x_tolerance=3, y_tolerance=3, keep_blank_chars=False, use_text_flow=False, horizontal_ltr=True, vertical_ttb=True, extra_attrs=[])
.extract_tables(table_settings)
.to_image(**conversion_kwargs)
.close()

Explanation:

crop 主要用于裁剪页面
within_bbox 类似crop，但是如果对象含有超出裁剪边框（bounding box）部分，则不包含此对象
dedupe_chars 返回具有重复字符的页面版本
extract_text 将 page 中的字符对象（character object）装入单个字符串（string）
extract_words 提取单词，返回一个 list 返回 pdf 文件中字符的属性
extracr_tables 提取表格中的内容，并转为一个list对象
to_image 返回关于 PageImage 类的实例
close 刷新缓存

（四）对象（Objects）

pdfplumber.PDF 和 pdfplumber.Page 的每个实例。页面提供对多种类型的PDF对象的访问，所有着些操作都来自 pdfminer.six 库对PDF的解析。

chars
lines
rects
curves
images
annots
hyperlinks

每一个object具体见文初链接。

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。