使用Python快速提取PPT中的文本内容

Eiceblue

已于 2024-03-07 14:09:00 修改

阅读量1.5k

点赞数 14

分类专栏： Python Presentation 文章标签： python powerpoint 开发语言

于 2024-03-07 14:08:36 首次发布

本文链接：https://blog.csdn.net/eiceblue/article/details/136532235

版权

Python Presentation 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

直接提取PPT中的文本内容可以方便我们进行进一步处理或分析，也可以直接用于其他文档的编撰。通过使用Python程序，我们可以快速批量提取PPT中的文本内容，从而实现高效的信息收集或对其中的数据进行分析。本文将介绍如何使用Python程序提取PowerPoint演示文稿中的文本内容，包括幻灯片中的主体文本、幻灯片备注文本以及幻灯片。

文章目录

本文所使用的方法需要用到Spire.Presentation for Python，可从官网下载或通过PyPI安装：pip install Spire.Presentation。

申请免费License

用Python提取PPT幻灯片文本

在PPT幻灯片中，文本内容放置在各种形状中，如文本框、图形。我们可以先获取幻灯片中的形状，再提取其中的文本，从而实现对幻灯片文本内容的提取。以下是操作步骤：

创建 Presentation 对象并使用 Presentation.LoadFromFile() 方法载入PPT。
遍历PPT中的幻灯片，然后遍历幻灯片中的形状。
判断形状是否为 IAutoShape 实例。如果是，则通过 IAutoShape.TextFrame.Paragraphs 获取其中的段落，再通过 Paragraph.Text 属性获取段落中的文本。
将文本写入到文本文件。

代码示例：

Python

从 spire.presentation 导入 *
从 spire.presentation.common 导入 *

# 创建 Presentation 类的对象
pres = Presentation()

# 加载 PowerPoint 演示文稿
pres.LoadFromFile("示例.pptx")

text = []
# 循环遍历每个幻灯片
对于 slide 在 pres.Slides 中:
    # 循环遍历每个形状
    对于 shape 在 slide.Shapes 中:
        # 检查形状是否为 IAutoShape 实例
        如果 isinstance(shape, IAutoShape):
            # 从形状中提取文本
            对于 paragraph 在 shape.TextFrame.Paragraphs 中:
                text.append(paragraph.Text)

# 将文本写入文本文件
f = open("output/幻灯片文本.txt", "w", encoding='utf-8')
对于 s 在 text 中:
    f.write(s + "\n")
f.close()
pres.Dispose()

提取结果：

Python提取PPT幻灯片文本

用Python提取PPT备注文本

备注是基于幻灯片添加的额外信息，可以对演讲者进行引导或提示，且不会展示给观众。幻灯片的备注储存在 NotesSlide 对象中，可以通过 ISlide.NotesSlide 属性获取。再获取到改对象之后，就可以使用 NotesSlide.NotesTextFrame.Text 属性提取其中的文本了。以下是操作步骤：

创建 Presentation 对象并使用 Presentation.LoadFromFile() 方法载入PPT。
遍历PPT中的幻灯片，通过 ISlide.NotesSlide 属性获取 NotesSlide 对象，再通过 NotesSlide.NotesTextFrame.Text 属性提取备注文本。
将文本写入文本文件。

代码示例：

Python

从 spire.presentation 导入 *
从 spire.presentation.common 导入 *

# 创建 Presentation 类的对象
pres = Presentation()

# 加载 PowerPoint 演示文稿
pres.LoadFromFile("示例.pptx")

notes_list = []
# 循环遍历每个幻灯片
对于 slide 在 pres.Slides 中:
    # 获取备注幻灯片
    notes_slide = slide.NotesSlide
    # 获取备注内容
    notes = notes_slide.NotesTextFrame.Text
    notes_list.append(notes)

# 将备注写入文本文件
f = open("output/备注文本.txt", "w", encoding="utf-8")
对于 note 在 notes_list 中:
    f.write(note)
    f.write("\n")
f.close()
pres.Dispose()

提取结果：

Python提取PPT备注文本

用Python提取PPT批注文本

我们还可以通过 ISlide.Comments 属性获取PPT幻灯片中的批注，并通过 Comment.Text 属性获取批注中的文本。以下是操作步骤：

创建 Presentation 对象并使用 Presentation.LoadFromFile() 方法载入PPT。
遍历幻灯片，通过 ISlide.Comments 属性获取每张幻灯片中的批注的集合。
遍历批注，通过 Comment.Text 属性提取批注中的文本。
将文本写入到文本文件。

代码示例：

Python

from spire.presentation import *
from spire.presentation.common import *

# 创建 Presentation 类的对象
pres = Presentation()

# 加载一个 PowerPoint 演示文稿
pres.LoadFromFile("示例.pptx")

comments_list = []
# 遍历所有幻灯片
for slide in pres.Slides:
    # 获取幻灯片中的所有评论
    comments = slide.Comments
    # 遍历评论
    for comment in comments:
        # 获取评论文本
        comment_text = comment.Text
        comments_list.append(comment_text)

# 将评论写入文本文件
f = open("output/批注文本.txt", "w", encoding="utf-8")
for comment in comments_list:
    f.write(comment + "\n")
f.close()
pres.Dispose()