第八篇【传奇开心果系列】Python自动化办公库技术点案例示例：深度解读使用Python库清洗处理从PDF文件提取的文本

本文链接：https://blog.csdn.net/m0_61369275/article/details/138387693


在这个示例代码中，我们定义了一个`remove_whitespace`函数，它接受一个文本字符串作为输入，并使用`re.sub`方法将连续的空白字符替换为单个空格。正则表达式`r'\s+'`匹配一个或多个连续的空白字符。


然后，我们调用`remove_whitespace`函数，并将示例文本作为输入。最后，我们打印出去除空白字符后的文本。


运行示例代码，输出将是去除了额外空格的文本：

This is some example text with extra spaces.


你可以在实际应用中将这个函数应用于从PDF中提取的文本，以去除不需要的空白字符。根据需要，你还可以添加其他的字符串处理操作来清理和调整文本内容。


### 五、使用Python库合并段落和行示例代码


![在这里插入图片描述](https://img-blog.csdnimg.cn/96a29a6cf9854faea1eb8abe40990586.jpg)


当从PDF或其他文本源中提取文本时，有时会出现段落被分割成多行的情况。如果想将这些分割的行重新合并成完整的段落，可以使用Python的字符串操作方法。以下是一个示例代码，演示了如何使用Python库合并段落和行：

def merge_paragraphs(lines):
merged_lines = []
current_paragraph = “”

for line in lines:
    line = line.strip()  # 去除行首尾的空白字符

    if line:  # 如果行不为空
        current_paragraph += line + " "  # 将行添加到当前段落中
    elif current_paragraph:  # 如果行为空且当前段落不为空
        merged_lines.append(current_paragraph.strip())  # 添加当前段落到合并的行列表中
        current_paragraph = ""  # 重置当前段落

if current_paragraph:  # 处理最后一个段落
    merged_lines.append(current_paragraph.strip())

merged_text = "\n".join(merged_lines)  # 将合并的行列表转换为文本

return merged_text

示例文本

lines = [
“This is the first line of the first paragraph.”,
“This is the second line of the first paragraph.”,
“”,
“This is the first line of the second paragraph.”,
“This is the second line of the second paragraph.”
]

merged_text = merge_paragraphs(lines)
print(merged_text)


在这个示例代码中，我们定义了一个`merge_paragraphs`函数，它接受一个包含行文本的列表作为输入。我们遍历每一行，并根据行的内容进行合并。


如果行不为空，我们将其添加到当前段落字符串中，并在行末尾添加一个空格。如果行为空且当前段落不为空，我们将当前段落添加到合并的行列表中，并重置当前段落字符串。


最后，我们处理最后一个段落，将其添加到合并的行列表中。


最后，我们使用`\n`作为分隔符，将合并的行列表转换为文本字符串。


运行示例代码，输出将是合并后的段落文本：

This is the first line of the first paragraph. This is the second line of the first paragraph.
This is the first line of the second paragraph. This is the second line of the second paragraph.


你可以将这个函数应用于从PDF中提取的文本，以合并被分割的行形成完整的段落。根据需要，你还可以添加其他的字符串处理操作来清理和调整文本内容。


### 六、使用Python库处理特殊字符和编码示例代码


![在这里插入图片描述](https://img-blog.csdnimg.cn/103666acfcb444a09b8247da0df8d131.png)


处理特殊字符和编码是在提取PDF文本时非常重要的一步。Python提供了多个库和工具来处理这些情况。


1. 使用Python的内置字符串函数来处理Unicode字符：

示例文本包含Unicode字符

text = “This is some text with Unicode characters: \u2022 bullet point, \u00A9 copyright symbol”

处理Unicode字符

decoded_text = text.encode(‘utf-8’).decode(‘unicode_escape’)

print(decoded_text)


在这个示例代码中，我们定义了一个包含Unicode字符的示例文本。我们使用`encode('utf-8').decode('unicode_escape')`将文本编码为UTF-8，并解码为Unicode字符。最后，我们打印解码后的文本。


2. 使用第三方库fitz来处理非标准字体示例代码

import fitz

打开PDF文件

pdf_file = “example.pdf”
doc = fitz.open(pdf_file)

提取文本内容

text = “”
for page in doc:
text += page.ge