PDFPLUMBER说明书--中文版

最新推荐文章于 2024-05-05 20:04:55 发布

趣享Eureka

最新推荐文章于 2024-05-05 20:04:55 发布

阅读量1.2w

点赞数 7

分类专栏：大牛技术文章标签： python

原文链接：https://github.com/jsvine/pdfplumber

版权

大牛技术专栏收录该内容

2 篇文章 0 订阅

订阅专栏

pdfplumber

如果需要抽取pdf 中的表格 OR 其他数据，欢迎评论哦

Original Website:https://github.com/jsvine/pdfplumber#visual-debugging

Plumb a PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging.

Works best on machine-generated, rather than scanned, PDFs. Built on pdfminer.six.

Currently tested on Python 3.6, 3.7, and 3.8.

Note: pdfplumber v0.5.22 was the final version to support Python 3.5.

Installation
Command line interface
Python library
Visual debugging
Extracting tables
Extracting form values
Demonstrations
Comparison to other libraries

Installation

pip install pdfplumber

Command line interface

Basic example

curl "https://raw.githubusercontent.com/jsvine/pdfplumber/stable/examples/pdfs/background-checks.pdf" > background-checks.pdf

pdfplumber < background-checks.pdf > background-checks.csv

The output will be a CSV containing info about every character, line, and rectangle in the PDF.

输出CSV文件，其中包含有关PDF中每个字符，行和矩形的信息。

Options

Argument	Description
`--format [format]`	`csv` or `json`. The `json` format returns more information; it includes PDF-level and page-level metadata, plus dictionary-nested attributes. csv或json. 该JSON格式返回的更多信息; 它包括PDF级和页面级元数据，以及词典嵌套属性。
`--pages [list of pages]`	A space-delimited, `1`-indexed list of pages or hyphenated page ranges. E.g., `1, 11-15`, which would return data for pages 1, 11, 12, 13, 14, and 15. 以空格分隔的带页码的页面列表或带连字符的页面范围。例如1、11-15 ，这将返回第1、11、12、13、14和15页的数据。
`--types [list of object types to extract]`	–types [要提取的对象类型列表]Choices are `char`, `rect`, `line`, `curve`, `image`, `annot`. Defaults to all.

Python library

Basic example

import pdfplumber

with pdfplumber.open("path/to/file.pdf") as pdf:
    first_page = pdf.pages[0]
    print(first_page.chars[0])

Loading a PDF

To start working with a PDF, call pdfplumber.open(x), where x can be a:

path to your PDF file
file object, loaded as bytes
file-like object, loaded as bytes

The open method returns an instance of the pdfplumber.PDF class.

To load a password-protected PDF, pass the password keyword argument, e.g., pdfplumber.open("file.pdf", password = "test").

The `pdfplumber.PDF` class

The top-level pdfplumber.PDF class represents a single PDF and has two main properties:

顶级pdfplumber.PDF类表示单个PDF,并具有两个主要属性

Property	Description
`.metadata`	A dictionary of metadata key/value pairs, drawn from the PDF’s `Info` trailers. Typically includes “CreationDate,” “ModDate,” “Producer,” etc 从PDF的信息预告片提取的元数据键/值对字典
`.pages`	A list containing one `pdfplumber.Page` instance per page loaded. 包含每个加载的页面的pdfplumber.Page实例的列表

The `pdfplumber.Page` class

The pdfplumber.Page class is at the core of pdfplumber. Most things you’ll do with pdfplumber will revolve around(围绕) this class. It has these main properties:

Property	Description
`.page_number`	The sequential page number, starting with `1` for the first page, `2` for the second, and so on.
`.width`	The page’s width.
`.height`	The page’s height.
`.objects` / `.chars` / `.lines` / `.rects` / `.curves` / `.figures` / `.images`	Each of these properties is a list, and each list contains one dictionary for each such object embedded on the page. For more detail, see “Objects” below. 这些属性的类型都是列表，并且每个列表都包含一个嵌入页面的对象的字典。有关更多详细信息，请参见下面的对象

… and these main methods:

Method	Description
`.crop(bounding_box, relative=False)`	Returns a version of the page cropped to the bounding box, which should be expressed as 4-tuple with the values `(x0, top, x1, bottom)`. Cropped pages retain objects that fall at least partly within the bounding box. If an object falls only partly within the box, its dimensions are sliced to fit the bounding box. If `relative=True`, the bounding box is calculated as an offset from the top-left of the page’s bounding box, rather than an absolute positioning. (See Issue #245 for a visual example and explanation.) 返回裁剪到边框的页面副本，该版本应表示为具有值（x0，top，x1，bottom）的4元组。裁剪后的页面保留了至少部分位于边界框内的对象。如果对象仅部分落在该框内，则对其尺寸进行切片以适合边界框。如果relative = True ，则将边界框计算为距页面边界框左上角的偏移量，而不是绝对位置。??（有关直观示例和说明，请参见第245期。）
`.within_bbox(bounding_box, relative=False)`	Similar to `.crop`, but only retains objects that fall entirely within the bounding box. 类似于.crop()方法，但只保留完全落在边框内的对象。
`.filter(test_function)`	Returns a version of the page with only the `.objects` for which `test_function(obj)` returns `True`. 返回一个版本的页面，只有在.objects为其test_function（OBJ）返回真。
`.extract_text(x_tolerance=3, y_tolerance=3)`	Collates all of the page’s character objects into a single string. Adds spaces where the difference between the `x1` of one character and the `x0` of the next is greater than `x_tolerance`. Adds newline characters where the difference between the `doctop` of one character and the `doctop` of the next is greater than `y_tolerance`. 将页面的所有字符对象整理成一个单一的字符串。当一个字符的 "x1 "和下一个字符的 "x0 "之间的差异大于 "x_tolerance "时，添加空格。当一个字符的 "doctop "和下一个字符的 "doctop "之间的差异大于 "y_tolerance "时，添加换行字符。
`.extract_words(x_tolerance=3, y_tolerance=3, horizontal_ltr=True, vertical_ttb=True)`	Returns a list of all word-looking things and their bounding boxes. Words are considered to be sequences of characters where (for “upright” characters) the difference between the `x1` of one character and the `x0` of the next is less than or equal to `x_tolerance` and where the `doctop` of one character and the `doctop` of the next is less than or equal to `y_tolerance`. A similar approach is taken for non-upright characters, but instead measuring the vertical, rather than horizontal, distances between them. The parameters `horizontal_ltr` and `vertical_ttb` indicate whether the words should be read from left-to-right (for horizontal words) / top-to-bottom (for vertical words). 返回所有具有单词外观的事物及其边界框的列表。话被认为是字符序列，其中（为“直立”字符）之间的差X1一个字符和X0的下一个小于或等于x_tolerance和其中doctop一个字符和doctop下一的小于或等于y_tolerance 。对于非垂直字符也可以采用类似的方法，但是要测量非垂直字符之间的垂直距离，而不是水平距离。参数horizontal_ltr和vertical_ttb指示是否应从左到右（对于水平单词）/从上到下（对于垂直单词）读取单词。
`.extract_tables(table_settings)`	Extracts tabular data from the page. For more details see “Extracting tables” below. 从页面中提取表格数据。有关更多详细信息，请参见下面的提取表
`.to_image(**conversion_kwargs)`	Returns an instance of the `PageImage` class. For more details, see “Visual debugging” below. For conversion_kwargs, see here. 返回PageImage类的实例。有关更多详细信息，请参见下面的可视调试。有关conversion_kwargs，请参见此处。**

Objects

Each instance of pdfplumber.PDF and pdfplumber.Page provides access to four types of PDF objects. The following properties each return a Python list of the matching objects:每个实例都提供对四种类型的PDF对象的访问。以下属性均返回匹配对象的Python列表:

.chars, each representing a single text character.
.lines, each representing a single 1-dimensional line.
.rects, each representing a single 2-dimensional rectangle.
.curves, each representing a series of connected points.
.images, each representing an image.
.figures, each representing a figure.
.annots, each representing a single PDF annotation (cf. Section 8.4 of the official PDF specification for details)
.hyperlinks, each representing a single PDF annotation of the subtype Link and having an URI action attribute 每个代表子类型Link的单个PDF注释，并具有URI操作属性

Each object is represented as a simple Python dict, with the following properties:

`char` properties

Property	Description
`page_number`	Page number on which this character was found.
`text`	E.g., “z”, or “Z” or " ".
`fontname`	Name of the character’s font face.
`size`	Font size.
`adv`	Equal to text width * the font size * scaling factor. 等于文本宽度*****字体大小*****缩放因子。
`upright`	Whether the character is upright. 直立
`height`	Height of the character.
`width`	Width of the character.
`x0`	Distance of left side of character from left side of page. 字符左侧到页面左侧的距离
`x1`	Distance of right side of character from left side of page.
`y0`	Distance of bottom of character from bottom of page.
`y1`	Distance of top of character from bottom of page.
`top`	Distance of top of character from top of page.
`bottom`	Distance of bottom of the character from top of page.
`doctop`	Distance of top of character from top of document.
`object_type`	“char”

`line` properties

Property	Description
`page_number`	Page number on which this line was found.
`height`	Height of line.
`width`	Width of line.
`x0`	Distance of left-side extremity from left side of page.左端距页面左侧的距离。
`x1`	Distance of right-side extremity from left side of page.
`y0`	Distance of bottom extremity from bottom of page.
`y1`	Distance of top extremity bottom of page.
`top`	Distance of top of line from top of page.
`bottom`	Distance of bottom of the line from top of page.
`doctop`	Distance of top of line from top of document.
`linewidth`	Thickness of line.
`object_type`	“line”

`rect` properties

Property	Description
`page_number`	Page number on which this rectangle was found.
`height`	Height of rectangle.
`width`	Width of rectangle.
`x0`	Distance of left side of rectangle from left side of page.
`x1`	Distance of right side of rectangle from left side of page.
`y0`	Distance of bottom of rectangle from bottom of page.
`y1`	Distance of top of rectangle from bottom of page.
`top`	Distance of top of rectangle from top of page.
`bottom`	Distance of bottom of the rectangle from top of page.
`doctop`	Distance of top of rectangle from top of document.
`linewidth`	Thickness(粗细) of line.
`object_type`	“rect”

`curve` properties

Property	Description
`page_number`	Page number on which this curve was found.
`points`	Points — as a list of `(x, top)` tuples — describing the curve.
`height`	Height of curve’s bounding box. 曲线边界框的高度。
`width`	Width of curve’s bounding box.
`x0`	Distance of curve’s left-most point from left side of page.
`x1`	Distance of curve’s right-most point from left side of the page.
`y0`	Distance of curve’s lowest point from bottom of page.
`y1`	Distance of curve’s highest point from bottom of page.
`top`	Distance of curve’s highest point from top of page.
`bottom`	Distance of curve’s lowest point from top of page.
`doctop`	Distance of curve’s highest point from top of document.
`linewidth`	Thickness of line.
`object_type`	“curve”

Additionally, both pdfplumber.PDF and pdfplumber.Page provide access to two derived lists of objects: .rect_edges (which decomposes each rectangle into its four lines) and .edges (which combines .rect_edges with .lines). 另外，pdfplumber.PDF和pdfplumber.Page都能够访问两个派生对象列表：.rect_edges （将每个矩形分解成四条线）和.edges （将.rect_edges与.lines结合在一起）。

`image` properties

[To be completed.]

`figure` properties

[To be completed.]

Visual debugging

Note: To use pdfplumber's visual-debugging tools, you’ll also need to have two additional pieces of software installed on your computer:

ImageMagick. Installation instructions here.
ghostscript. Installation instructions here, or simply apt install ghostscript (Ubuntu) / brew install ghostscript (Mac).

Creating a `PageImage` with `.to_image()`

To turn any page (including cropped pages) into an PageImage object, call my_page.to_image(). You can optionally pass a resolution={integer} keyword argument, which defaults to 72. E.g.:

im = my_pdf.pages[0].to_image(resolution=150)

PageImage objects play nicely with IPython/Jupyter notebooks; they automatically render as cell outputs. For example:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-5w4FcX6Z-1604391297909)(examples/screenshots/visual-debugging-in-jupyter.png “Visual debugging in Jupyter”)]

Basic `PageImage` methods

Method	Description
`im.reset()`	Clears anything you’ve drawn so far.
`im.copy()`	Copies the image to a new `PageImage` object.
`im.save(path_or_fileobject, format="PNG")`	Saves the annotated image.

Drawing methods

You can pass explicit coordinates or any pdfplumber PDF object (e.g., char, line, rect) to these methods.

Single-object method	Bulk method	Description
`im.draw_line(line, stroke={color}, stroke_width=1)`	`im.draw_lines(list_of_lines, **kwargs)`	Draws a line from a `line`, `curve`, or a 2-tuple of 2-tuples (e.g., `((x, y), (x, y))`).
`im.draw_vline(location, stroke={color}, stroke_width=1)`	`im.draw_vlines(list_of_locations, **kwargs)`	Draws a vertical line at the x-coordinate indicated by `location`.在由location指示的x坐标处绘制一条垂直线。
`im.draw_hline(location, stroke={color}, stroke_width=1)`	`im.draw_hlines(list_of_locations, **kwargs)`	Draws a horizontal line at the y-coordinate indicated by `location`.
`im.draw_rect(bbox_or_obj, fill={color}, stroke={color}, stroke_width=1)`	`im.draw_rects(list_of_rects, **kwargs)`	Draws a rectangle from a `rect`, `char`, etc., or 4-tuple bounding box.
`im.draw_circle(center_or_obj, radius=5, fill={color}, stroke={color})`	`im.draw_circles(list_of_circles, **kwargs)`	Draws a circle at `(x, y)` coordinate or at the center of a `char`, `rect`, etc.

Note: The methods above are built onPillow’s ImageDraw methods, but the parameters have been tweaked for consistency with SVG’s fill/stroke/stroke_width nomenclature.注意：上面的方法基于Pillow的ImageDraw方法构建，但是已经对参数进行了调整，以与SVG的fill / stroke / stroke_width命名法保持一致。

Extracting tables

pdfplumber's approach to table detection borrows heavily from Anssi Nurminen’s master’s thesis, and is inspired by Tabula. It works like this:

For any given PDF page, find the lines that are (a) explicitly defined and/or (b) implied by the alignment of words on the page.
Merge overlapping, or nearly-overlapping, lines.
Find the intersections of all those lines.
Find the most granular set of rectangles (i.e., cells) that use these intersections as their vertices.
Group contiguous cells into tables.

对于任何给定的PDF页面，找到页面上单词的对齐方式明确定义的行（a）和/或隐含的行（b）。

合并重叠或几乎重叠的线。

找到所有这些线的交点。

查找使用这些相交作为其顶点的最细粒度的矩形集（即单元）。

将连续的单元格分组到表中。

Table-extraction methods

pdfplumber.Page objects can call the following table methods:

Method	Description
`.find_tables(table_settings={})`	Returns a list of `Table` objects. The `Table` object provides access to the `.cells`, `.rows`, and `.bbox` properties, as well as the `.extract(x_tolerance=3, y_tolerance=3)` method.
`.extract_tables(table_settings={})`	Returns the text extracted from all tables found on the page, represented as a list of lists of lists, with the structure `table -> row -> cell`.
`.extract_table(table_settings={})`	Returns the text extracted from the largest table on the page, represented as a list of lists, with the structure `row -> cell`. (If multiple tables have the same size — as measured by the number of cells — this method returns the table closest to the top of the page.)（如果多个表的大小相同（以单元格的数量来衡量，则此方法将返回最接近页面顶部的表。）
`.debug_tablefinder(table_settings={})`	Returns an instance of the `TableFinder` class, with access to the `.edges`, `.intersections`, `.cells`, and `.tables` properties.

For example:

pdf = pdfplumber.open("path/to/my.pdf")
page = pdf.pages[0]
page.extract_table()

Click here for a more detailed example.

Table-extraction settings

By default, extract_tables uses the page’s vertical and horizontal lines (or rectangle edges) as cell-separators. But the method is highly customizable via the table_settings argument. The possible settings, and their defaults:

默认情况下，extract_tables使用页面的垂直和水平线（或矩形边缘）作为单元格分隔符。但是该方法可以通过table_settings参数进行高度自定义。可能的设置及其默认值

{
    "vertical_strategy": "lines", 
    "horizontal_strategy": "lines",
    "explicit_vertical_lines": [],
    "explicit_horizontal_lines": [],
    "snap_tolerance": 3,
    "join_tolerance": 3,
    "edge_min_length": 3,
    "min_words_vertical": 3,
    "min_words_horizontal": 1,
    "keep_blank_chars": False,
    "text_tolerance": 3,
    "text_x_tolerance": None,
    "text_y_tolerance": None,
    "intersection_tolerance": 3,
    "intersection_x_tolerance": None,
    "intersection_y_tolerance": None,
}

Setting	Description
`"vertical_strategy"`	Either `"lines"`, `"lines_strict"`, `"text"`, or `"explicit"`. See explanation below.请参阅下面的说明。
`"horizontal_strategy"`	Either `"lines"`, `"lines_strict"`, `"text"`, or `"explicit"`. See explanation below.
`"explicit_vertical_lines"`	A list of vertical lines that explicitly demarcate cells in the table. Can be used in combination with any of the strategies above. Items in the list should be either numbers — indicating the `x` coordinate of a line the full height of the page — or `line`/`rect`/`curve` objects.垂直线列表，用于明确划分表格中的单元格。可以与以上任何策略结合使用。列表中的项目应为数字（表示页面的整个高度的直线的x坐标）或直线/矩形/曲线对象。
`"explicit_horizontal_lines"`	A list of horizontal lines that explicitly demarcate cells in the table. Can be used in combination with any of the strategies above. Items in the list should be either numbers — indicating the `y` coordinate of a line the full height of the page — or `line`/`rect`/`curve` objects.明确划分表格中单元格的水平线列表。可以与以上任何策略结合使用。列表中的项目应为数字（表示页面的整个高度的直线的y坐标）或直线/矩形/曲线对象。
`"snap_tolerance"`	Parallel lines within `snap_tolerance` pixels will be “snapped” to the same horizontal or vertical position.像素内的平行线将被捕捉到相同的水平或垂直位置。
`"join_tolerance"`	Line segments on the same infinite line, and whose ends are within `join_tolerance` of one another, will be “joined” into a single line segment.同一条无限线上的线段（其端点在彼此的join_tolerance之内）将被“接合”为单个线段。
`"edge_min_length"`	Edges shorter than `edge_min_length` will be discarded before attempting to reconstruct the table.短于edge_min_length的边将在尝试重建表之前被丢弃
`"min_words_vertical"`	When using `"vertical_strategy": "text"`, at least `min_words_vertical` words must share the same alignment.使用“ vertical_strategy”：“ text”时，至少min_words_vertical个单词必须共享相同的对齐方式。
`"min_words_horizontal"`	When using `"horizontal_strategy": "text"`, at least `min_words_horizontal` words must share the same alignment.当使用“ horizontal_strategy”：“ text”时，至少min_words_horizontal个单词必须共享相同的对齐方式。
`"keep_blank_chars"`	When using the `text` strategy, consider `" "` chars to be parts of words and not word-separators.当使用文本策略，认为“”字符是部分的话，而不是文字分隔符。
`"text_tolerance"`, `"text_x_tolerance"`, `"text_y_tolerance"`	When the `text` strategy searches for words, it will expect the individual letters in each word to be no more than `text_tolerance` pixels apart.当文本策略搜索单词时，它将期望每个单词中的各个字母相差不超过text_tolerance像素。
`"intersection_tolerance"`, `"intersection_x_tolerance"`, `"intersection_y_tolerance"`	When combining edges into cells, orthogonal edges must be within `intersection_tolerance` pixels to be considered intersecting.当边缘结合到细胞中，正交边缘必须内intersection_tol erance被认为相交像素。

Table-extraction strategies

Both vertical_strategy and horizontal_strategy accept the following options:

Strategy	Description
`"lines"`	Use the page’s graphical lines — including the sides of rectangle objects — as the borders of potential table-cells.行使用页面的图形线（包括矩形对象的边）作为潜在表格单元格的边界。
`"lines_strict"`	Use the page’s graphical lines — but not the sides of rectangle objects — as the borders of potential table-cells.使用页面的图形线（而不是矩形对象的边）作为潜在表格单元格的边界。
`"text"`	For `vertical_strategy`: Deduce the (imaginary) lines that connect the left, right, or center of words on the page, and use those lines as the borders of potential table-cells. For `horizontal_strategy`, the same but using the tops of words.对于vertical_strategy ：推论连接页面上单词的左，右或中心的（虚构）线，并将这些线用作潜在的表格单元格的边界。对于horizontal_strategy ，相同，但使用单词的开头。
`"explicit"`	Only use the lines explicitly defined in `explicit_vertical_lines` / `explicit_horizontal_lines`.只使用在explicit_vertical_lines / explicit_horizontal_lines中明确定义的行。

Notes

Often it’s helpful to crop a page — Page.crop(bounding_box) — before trying to extract the table.
Table extraction for pdfplumber was radically redesigned(彻底的重新设计) for v0.5.0, and introduced breaking changes.

Extracting form values

Sometimes PDF files can contain forms that include inputs that people can fill out and save. While values in form fields appear like other text in a PDF file, form data is handled differently. If you want the gory details, see page 671 of this specification.

pdfplumber doesn’t have an interface for working with form data, but you can access it using pdfplumber's wrappers around pdfminer.

For example, this snippet will retrieve form field names and values and store them in a dictionary. You may have to modify this script to handle cases like nested fields (see page 676 of the specification).

有时，PDF文件可以包含人们可以填写和保存的输入单元格。虽然表格字段中的值与PDF文件中的其他文本一样，但表格数据的处理方式不同。如果需要详细信息，请参阅本规范的第671页。

pdfplumber没有用于处理表单数据的接口，但是您可以使用pdfplumber的装饰器装饰pdfminer来访问它。

例如，此代码段将检索表单字段名称和值并将其存储在字典中。您可能需要修改此脚本以处理诸如嵌套字段之类的情况（请参见规范的第676页）。

pdf = pdfplumber.open("document_with_form.pdf")

fields = pdf.doc.catalog["AcroForm"].resolve()["Fields"]

form_data = {}

for field in fields:
    field_name = field.resolve()["T"]
    field_value = field.resolve()["V"]
    form_data[field_name] = field_value

Demonstrations

Using extract_table on a California Worker Adjustment and Retraining Notification (WARN) report. Demonstrates basic visual debugging and table extraction.
Using extract_table on the FBI’s National Instant Criminal Background Check System PDFs. Demonstrates how to use visual debugging to find optimal(最佳的) table extraction settings. Also demonstrates Page.crop(...) and Page.extract_text(...).
Inspecting and visualizing curve objects.
Extracting fixed-width data from a San Jose PD firearm search report, an example of using Page.extract_text(...).

Comparison to other libraries

Several other Python libraries help users to extract information from PDFs. As a broad overview, pdfplumber distinguishes itself from other PDF processing libraries by combining these features:

Easy access to detailed information about each PDF object
Higher-level, customizable methods for extracting text and tables
Tightly integrated visual debugging
Other useful utility functions, such as filtering objects via a crop-box

It’s also helpful to know what features pdfplumber does not provide:

PDF generation
PDF modification
Optical character recognition (OCR)
Strong support for extracting tables from OCR’ed documents

Specific comparisons

pdfminer.six provides the foundation for pdfplumber. It primarily focuses on parsing PDFs, analyzing PDF layouts and object positioning, and extracting text. It does not provide tools for table extraction or visual debugging.
pymupdf is substantially faster than pdfminer.six (and thus also pdfplumber) and can generate and modify PDFs, but the library requires installation of non-Python software (MuPDF). It also does not enable easy access to shape objects (rectangles, lines, etc.), and does not provide table-extraction or visual debugging tools.
camelot, tabula-py, and pdftables all focus primarily on extracting tables. In some cases, they may be better suited to the particular tables you are trying to extract.
PyPDF2 and its successor libraries appear no longer to be maintained.

reference

How to plumb a pdf

https://zhuanlan.zhihu.com/p/100462752

趣享Eureka

关注

7
点赞
踩
37

收藏

觉得还不错? 一键收藏
3
评论
PDFPLUMBER说明书--中文版

pdfplumberOriginal Website:https://github.com/jsvine/pdfplumber#visual-debuggingPlumb a PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging.Works best on machine-generated, rather than
复制链接

扫一扫