Python+spire.doc：读取Word文档内容

最新推荐文章于 2025-03-20 16:42:39 发布

觅远

最新推荐文章于 2025-03-20 16:42:39 发布

阅读量675

点赞数 3

分类专栏：自动化办公 python 文章标签： python word c#

本文链接：https://blog.csdn.net/JBY2020/article/details/145593276

版权

python 同时被 2 个专栏收录

102 篇文章

订阅专栏

自动化办公

52 篇文章

订阅专栏

注意，文件在读取或写入操作时必须是关闭状态，否则会报错。

读取全部文本内容

from spire.doc import *
from spire.doc.common import *

inputFile = r'自检测试报告.doc'
outputFile = r'自检测试报告.docx'

document = Document()  # 创建Document实例
document.LoadFromFile(inputFile)  # 加载Word文档
document_text = document.GetText()
print(document_text)

通过节点读取数据

Document.Sections[index] 属性可用于获取Word 文档中的特定节点。获取后，可遍历该节中的段落、表格等。

print(len(document.Sections))  # 获取节点数量
print(document.Sections.Count)  # 获取节点数量
section = document.Sections

# 分段落获取文本内容
for i in range(document.Sections.Count):
    paragraphs = section[i].Paragraphs
    for p in range(paragraphs.Count):
        print(paragraphs[p].Text)

按页读取

因为Word文档本质上是流式文档，流式布局，所以没有“页面”的概念。为了方便页面操作，Spire.Doc for Python提供了FixedLayoutDocument类，用于将Word文档转换为固定布局。

layoutDoc = FixedLayoutDocument(document)  # 创建FixedLayoutDocument类的实例，用于将Word文档转换为固定布局。

print(layoutDoc.Pages.Count)

for p in range(layoutDoc.Pages.Count):
    page_data = layoutDoc.Pages[p]
    # print(page_data.Text)   # 按页读取文本
    cols_data = page_data.Columns
    for col in range(len(cols_data)):
        # print(cols_data[col].Text)  # 按段读取文本
        row_data = cols_data[col].Lines
        for row in range(len(row_data)):
            print(row_data[row].Text)  # 按行读取文本

读取页眉页脚

section = document.Sections

for i in range(document.Sections.Count):

    header = section[i].HeadersFooters.Header  # 获取该节的页眉对象

    footer = section[i].HeadersFooters.Footer  # 获取该节的页脚对象
    for h in range(header.Paragraphs.Count):
        headerPara = header.Paragraphs[h]
        print(headerPara.Text)
        
    for f in range(footer.Paragraphs.Count):
        footerPara = footer.Paragraphs[f]
        print(footerPara.Text)

遍历表格数据

document = Document()  # 创建Document实例
document.LoadFromFile(inputFile)  # 加载Word文档

for i in range(document.Sections.Count):
    section = document.Sections.get_Item(i)
    for j in range(section.Tables.Count):
        table = section.Tables.get_Item(j)

        # 遍历表格中的行
        for row in range(table.Rows.Count):
            row_data = []

            # 遍历行中的单元格
            for cell in range(table.Rows.get_Item(row).Cells.Count):
                cell_obj = table.Rows.get_Item(row).Cells.get_Item(cell)
                cell_text = ""

                # 获取单元格中的段落内容
                for paragraph_index in range(cell_obj.Paragraphs.Count):
                    paragraph = cell_obj.Paragraphs.get_Item(paragraph_index)
                    cell_text += paragraph.Text

                row_data.append(cell_text.strip())

            # 打印行数据
            print(row_data)
            
document.Close()

查找指定文本

def FindAllString(self ,matchString:str,caseSensitive:bool,wholeWord:bool)->List['TextSelection']:

参数：

matchString:str，要查找的内容
caseSensitive:bool，如果为True，匹配是区分大小写的。
wholeWord:bool，如果为True，匹配的必须是一个完整的单词。

可对查找的内容进行其他操作

document = Document()  # 创建Document实例
document.LoadFromFile(inputFile)  # 加载Word文档

textSelections = document.FindAllString("测试报告", False, True)

# 对找到的内容设置高亮显示颜色
for selection in textSelections:
    selection.GetAsOneRange().CharacterFormat.HighlightColor = Color.get_Blue()

document.SaveToFile(outputFile, FileFormat.Docx)
document.Close()