感知机 pdf_开发结构感知的pdf解析器

最新推荐文章于 2024-06-03 18:16:38 发布

weixin_26705651

最新推荐文章于 2024-06-03 18:16:38 发布

阅读量295

点赞数

原文链接：https://medium.com/@_chriz_/development-of-a-structure-aware-pdf-parser-7285f3fe41a9

版权

感知机 pdf

This article introduces briefly a PDF parsing library named pdfstructure that I am currently developing. The library attempts to capture the original document hierarchy and to make the relation between chapters, headers and paragraphs accessible in a generic way.

本文简要介绍了 我目前正在开发的 名为 pdfstructure 的PDF解析库 。 该库试图捕获原始文档的层次结构，并以通用的方式使章节，标题和段落之间的关系可访问。

为什么我开始开发PDF解析库 (Why I started to develop a PDF parsing library)

When I am working on customer projects that involve document parsing and textual data retrieval, I have to work myself through the same usual questions again and again:

当我从事涉及文档解析和文本数据检索的客户项目时，我不得不一次又一次地解决同样的常见问题：

In what format is the textual data presented?
文本数据以什么格式显示？
Is it structured data accessible through a database?
是否可以通过数据库访问结构化数据？
Or is it just a zipped excerpt of a folder structure containing unstructured documents of various types like office documents and html files?
还是仅仅是包含不同类型的非结构化文档(例如Office文档和html文件)的文件夹结构的摘录片段？

Many times I have been confronted with the latter scenario, where customer specific documents had to be parsed specifically for a particular use case.

我经常遇到后一种情况，在这种情况下，必须针对特定用例专门分析客户特定的文档。

Of course, libraries already exist that cover raw text extraction, but from my experience popular libraries like textract or pypdf2 focus on extracting just the raw text.

当然，已经存在涵盖原始文本提取的库，但是根据我的经验，流行的库(例如t extract或pypdf2)专注于仅提取原始文本。

为什么会有问题呢？ (Why is that a problem?)

Valuable information about the original document hierarchy capturing relations between paragraphs, chapter headers and so on is lost after parsing.But exactly that textual structure can add significant value to use cases where it’s important to represent and process data systematically in a meaningful way.

解析后，有关原始文档层次结构的重要信息会丢失段落，章节标题等之间的关系，但正是这种文本结构可以为用例增加有意义的价值，在这种情况下，以有意义的方式系统地表示和处理数据很重要。

The resulting performance of a project could be worse when unstructured data like text documents of different type, size and layout is modeled and applied in the same way to solve a business problem. For example when all kind of documents are directly ingested to a search index as they are without analyzing the underlying data first.

当非结构化数据(例如，不同类型，大小和布局的文本文档)以相同方式建模和应用以解决业务问题时，项目的最终性能可能会更差。例如，将所有类型的文档按原样直接摄取到搜索索引时，无需先分析基础数据。

Figure 1 shows an example with

图1显示了一个示例

Document A as a user handbook that covers a range of topics in chapters and paragraphs in great detail
作为用户手册的文档A，在各章和段落中涵盖了非常广泛的主题
Document B as a single paged document with little amount of text
文档B为单页文档，文本很少

Instead of adding the raw text of those documents directly to the search index, document A could be split up into its top level chapters (and keeping a link to the original document). Additionally the chapter title can of course be added as an analysed field to the search index. That could then boost such a sub-document to be identified as the correct search hit for a given query.

除了将这些文档的原始文本直接添加到搜索索引中之外，还可以将文档A分为其顶层章节(并保留指向原始文档的链接)。另外，当然可以将章节标题作为分析字段添加到搜索索引中。然后，可以提升此类子文档的能力，以将其识别为给定查询的正确搜索命中。

自动文本结构解析简介 (Introduction to automated textual structure parsing)

I have started to develop the library pdfstructure in order to tackle the problem of parsing a documents structure independent of its layout in a generic way.

我已经开始开发库pdfstructure ，以解决以通用方式解析独立于其布局的文档结构的问题。

它是如何工作的？ (How does it work?)

pdfstructure is built on top of pdfminer.six that provides:

pdfstructure建立在pdfminer.six之上，该文件提供：

Text extraction directly from the PDF’s source code
直接从PDF的源代码中提取文本
Exposure of exact location, font and color of the extracted text
暴露所提取文本的确切位置，字体和颜色
Layout analysis to group text into lines and paragraphs
布局分析，可将文本分为行和段落

添加层次结构 (Adding hierarchy)

pdfstructure adds a processing step on top of the extracted flat paragraph list and creates a nested tree structure that should represent the original hierarchy.

pdfstructure在提取的平面段落列表的顶部添加了一个处理步骤，并创建了一个嵌套的树形结构，该结构应表示原始层次结构。

On a high level, the algorithm works as follows:

在较高的层次上，该算法的工作方式如下：

Analyze distribution of occurring character style features like font-size and font-name for a given document
分析给定文档的出现的字体样式特征(如字体大小和字体名称)的分布

Often tons of different font sizes are used within a single document.
通常在一个文档中使用成吨的不同字体大小。

To make life easier, font sizes are mapped to predefined sizes like
为了使生活更轻松，将字体大小映射到预定义的大小，例如

small, medium or large
小，中或大
Iterate through the paragraphs and annotate each of them with its predominated style like the mapped text size and character weight (bold)
遍历段落，并以其主导的样式(如映射的文本大小和字符粗细(粗体)) 注释每个段落
Categorize each paragraph into header or content
将每个段落分类为标题或内容
By leveraging the paragraph’s category, the document structure can be recreated as a general tree structure in one pass where
通过利用段落的类别，可以一次将文档结构 重新创建为常规树结构，其中

— smaller headers are treated as a sub-section of larger headers (parent)
-较小的标头被视为较大标头的子部分(父级)

— content paragraphs are children of a header paragraph
-内容段落是标题段落的子代

For humans it’s an easy task to group paragraphs accordingly based on visual cues like boldness or using the text size.

对于人类来说，根据诸如粗体或使用文字大小之类的视觉提示对段落进行相应的分组是一项轻松的任务。

# 1) An easy example — GitHub Page as PDFThe following example document uses distinctive style features to define the documents structure:

＃1)一个简单的示例— GitHub Page as PDF以下示例文档使用独特的样式功能来定义文档结构：

Image for post — TSiege/ TSiege / The Technical Interview Cheat Sheet.md) 技术访谈速查表.md )

Figure 3 showcases a subset of the parsed tree structure for the prior document.

图3展示了先前文档的已解析树结构的子集。

# 2) A somewhat harder example — Book parsing

＃2)一个更难的示例-图书解析

Book parsing can be harder since those are usually compiled of many chapters that include specific layout features like headers, footers or text boxes that highlight a specific paragraph.

书籍解析可能会比较困难，因为这些书籍通常是由许多章节组成，其中包括特定的布局功能，例如页眉，页脚或突出显示特定段落的文本框。

The following image (Figure 4) showcases a brief side by side comparison of the parsed document and the original PDF “Kafka: The Definitive Guide”.

下图(图4)展示了已解析文档与原始PDF“ Kafka：权威指南”的简要对比。

Left image with PyCharm debugging into the document model
带有PyCharm调试的左侧图像进入文档模型
Right image rendering the book using a PDF Viewer
使用PDF Viewer呈现书籍的正确图像

Note: Textual structure parsing is based purely on text style analysis; any additional information like interactive links are not used.

注意：文本结构解析完全基于文本样式分析； 不使用任何其他信息，例如交互式链接。

文件模型 (Document Model)

class StructuredDocument:
  metadata: dict
  sections: List[Section]class Section:
  content:  TextElement
  children: List[Section]
  level:    intclass TextElement:
  text:     LTTextContainer # the extracted paragraph from pdfminer
  style:    Style

用法 (Usage)

The project is still in early development, but it is already able to handle and represent various kinds of documents pretty well.

该项目仍处于早期开发中，但已经能够很好地处理和表示各种文档。

文字提取 (Text extraction)

from pdfstructure.hierarchy.parser import HierarchyParser
from pdfstructure.source import FileSource
parser = HierarchyParser() 
# specify source (that implements source.read())
source = FileSource(path) 
# analyse document and parse as nested data structure
document = parser.parse_pdf(source)

出口 (Export)

The extracted text is stored as a tree and can be serialized to JSON, or for debugging purposes simply printed in a pretty string format.

提取的文本存储为树，可以序列化为JSON，或者出于调试目的，只需以漂亮的字符串格式打印即可。

from pdfstructure.printer import PrettyStringPrinter
pretty_string_printer = PrettyStringPrinter()
pretty_string = pretty_string_printer.print(document)print(pretty_string)"
[Search Basics]
 [Breadth First Search]
  [Definition:]
   An algorithm that searches a tree (or graph) by searching levels
   of the tree first, starting at the root.
   It finds every node on the same level, most often moving left to 
   right.
  [What you need to know:]
   Optimal for searching a tree that is wider than it is deep.
   Uses a queue to store information about the tree while it    
   traverses a tree.
  [Time Complexity:]
   Search: Breadth First Search: O(V + E)
   E is number of edges
   V is number of vertices
"

The JsonFilePrinter implementation can be used to serialize the document to file (parsed example can be found here).

JsonFilePrinter实现可用于将文档序列化为文件(可在此处找到解析的示例)。

The document can of course be easily loaded from file whenever needed.

当然，可以随时根据需要从文件中轻松加载该文档。

from pdfstructure.model.document import StructuredPdfDocument
json_string = json.load(file)
document = StructuredPdfDocument.from_json(json_string)
print(document.title)
"interview_cheatsheet.pdf"

利用文本结构 (Leveraging textual structure)

Having all paragraphs and sections organised, its straight forward to iterate through the layers and search for specific elements like headlines, or extract all main headers like chapter titles.

整理好所有段落和节后，直接遍历图层并搜索诸如标题之类的特定元素，或者提取诸如章节标题之类的所有主要标题。

A parsed document can be traversed using the in-order or level-order generator implementations respectively.

解析过的文件可以使用被遍历in-order或level-order分别发生器的实施方式。

from pdfstructure.hierarchy.traversal import 
sections = [e for e in

The sections can then be used however necessary. The previous parsed document could then yield sections as shown below:

然后可以根据需要使用这些部分。然后，先前解析的文档可以产生如下所示的部分：

     "Search"      "Sorting"
      /    \        /      \
   "BFS"  "DFS"  "Merge"  "Quick"
   / | \    |     / | \      |
  d  w  t       d  w  t## yield order ###
["Search", "Sorting", "BFS", "DFS", "Quick", "Merge"]

The source code can be found on GitHub at ChrizH/pdfstructure. The library is written in Python 3 and in pre-alpha.

可以在GitHub的ChrizH / pdfstructure上找到源代码。该库使用Python 3和pre-alpha编写。

I am happy for thoughts or input of any kind!

我很高兴有任何想法或意见！

翻译自: https://medium.com/@_chriz_/development-of-a-structure-aware-pdf-parser-7285f3fe41a9

感知机 pdf

weixin_26705651

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
感知机 pdf_开发结构感知的pdf解析器

感知机 pdfThis article introduces briefly a PDF parsing library named pdfstructure that I am currently developing. The library attempts to capture the original document hierarchy and to make the relation...
复制链接

扫一扫