word的docx以xml格式实现Python自动化遇到图片无法正常显示以及word无法读取的内容报错的解决思路

wzp10

已于 2024-05-28 18:33:45 修改

阅读量737

点赞数 18

分类专栏： Word自动化文章标签： word xml 自动化 python

于 2024-05-28 18:28:45 首次发布

本文链接：https://blog.csdn.net/wzp10/article/details/139274676

版权

Word自动化专栏收录该内容

1 篇文章 0 订阅

订阅专栏

环境

from xml.dom import minidom

目标

从一个word文档保留原格式复制到另一个word文档中，直接使用xml.dom去操作word文档，而不使用python-docx库。

Word格式

文件格式

word的docx本质是个压缩包，其中最主要的文档相关的内容使用xml格式存储的。当然绝大多数的文件都是以xml格式存在。

文件结构

├─1
│  │  [Content_Types].xml  包含了word要使用到的扩展
│  │
│  ├─docProps
│  │      app.xml
│  │      core.xml         包含文件作者
│  │
│  ├─word
│  │  │  document.xml      word文档中文字表格图片等详细内容，也是编辑word的主要文件
│  │  │  endnotes.xml      
│  │  │  fontTable.xml
│  │  │  footnotes.xml
│  │  │  settings.xml
│  │  │  styles.xml
│  │  │  webSettings.xml
│  │  │
│  │  ├─media              媒体文件夹中包含了word中需要用的的媒体文件
│  │  │      image1.gif
│  │  │
│  │  ├─theme              主题文件夹
│  │  │      theme1.xml
│  │  │
│  │  └─_rels              里面包含着document.xml引用的关系，将各个文件用rId联系
│  │          document.xml.rels  里面包含着图片的引用关系，如果你修改的图片不显示大概这里没有改
│  │
│  └─_rels
│          .rels
├─2
│  │  [Content_Types].xml
│  │
│  ├─docProps
│  │      app.xml
│  │      core.xml
│  │
│  ├─word
│  │  │  document.xml
│  │  │  endnotes.xml
│  │  │  fontTable.xml
│  │  │  footnotes.xml
│  │  │  settings.xml
│  │  │  styles.xml
│  │  │  webSettings.xml
│  │  │
│  │  ├─theme
│  │  │      theme1.xml
│  │  │
│  │  └─_rels
│  │          document.xml.rels
│  │
│  └─_rels
│          .rels

<w:p> 表示一个段落
<w:r> 表示一个样式串，指明它包括的文本的显示样式
<w:t> 表示真正的文本内容
<w:tbl> 表示一个表格
- <w:tr> 表示表格一行
- <w:tc> 表示一个cell，我立理解是单元格

其中文字被包含在<w:t>的节点下应该用firstchild来指向

问题复现

当我去从一个word文档的文字，图片以及公式保留格式复制到另一个word文档中，我首先是将源文档中document.xml相关节点复制到目标文档中。效果是一切ok的，除了照片无法显示，这时我就去查看源文件信息，发现还有media文件夹里的内容没有复制过去。于是复制图片到目标文档的对应目录中，结果还是无法显示，并且这时打开word文档会出现无法读取的内容。

于是两个问题需要解决

如何正确显示图片？
如何解决word报错？

正确显示图片

以下是doucument.xml里图片相关信息

<w:drawing>
    <wp:inline distT="0" distB="0" distL="0" distR="0"
        wp14:anchorId="77FAF658" wp14:editId="331AEBE1">
        <wp:extent cx="2842786" cy="2132374" />
        <wp:effectExtent l="0" t="0" r="0" b="1270" />
        <wp:docPr id="1" name="图片 1" />
        <wp:cNvGraphicFramePr>
            <a:graphicFrameLocks
                xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main"
                noChangeAspect="1" />
        </wp:cNvGraphicFramePr>
        <a:graphic
            xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main">
            <a:graphicData
                uri="http://schemas.openxmlformats.org/drawingml/2006/picture">
                <pic:pic
                    xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture">
                    <pic:nvPicPr>
                        <pic:cNvPr id="0" name="Picture 1" />
                        <pic:cNvPicPr>
                            <a:picLocks noChangeAspect="1"
                                noChangeArrowheads="1" />
                        </pic:cNvPicPr>
                    </pic:nvPicPr>
                    <pic:blipFill>
                        <a:blip r:embed="rId6">
                            <a:extLst>
                                <a:ext
                                    uri="{28A0092B-C50C-407E-A947-70E740481C1C}">
                                    <a14:useLocalDpi
                                        xmlns:a14="http://schemas.microsoft.com/office/drawing/2010/main"
                                        val="0" />
                                </a:ext>
                            </a:extLst>
                        </a:blip>
                        <a:srcRect />
                        <a:stretch>
                            <a:fillRect />
                        </a:stretch>
                    </pic:blipFill>
                    <pic:spPr bwMode="auto">
                        <a:xfrm>
                            <a:off x="0" y="0" />
                            <a:ext cx="2847489" cy="2135902" />
                        </a:xfrm>
                        <a:prstGeom prst="rect">
                            <a:avLst />
                        </a:prstGeom>
                        <a:noFill />
                        <a:ln>
                            <a:noFill />
                        </a:ln>
                    </pic:spPr>
                </pic:pic>
            </a:graphicData>
        </a:graphic>
    </wp:inline>
</w:drawing>

棘手的问题在于这个，图片的插入是通过这个id实现的

<a:blip r:embed="rId6">

<Relationship Id="rId6"
    Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image"
    Target="media/image1.gif" />

这个是document.xml.rels文件对应的

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
    <Relationship Id="rId8"
        Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/theme"
        Target="theme/theme1.xml" />
    <Relationship Id="rId3"
        Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/webSettings"
        Target="webSettings.xml" />
    <Relationship Id="rId7"
        Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/fontTable"
        Target="fontTable.xml" />
    <Relationship Id="rId2"
        Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/settings"
        Target="settings.xml" />
    <Relationship Id="rId1"
        Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles"
        Target="styles.xml" />
    <Relationship Id="rId6"
        Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image"
        Target="media/image1.gif" />
    <Relationship Id="rId5"
        Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/endnotes"
        Target="endnotes.xml" />
    <Relationship Id="rId4"
        Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/footnotes"
        Target="footnotes.xml" />
</Relationships>

这个是文档2的

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
    <Relationship Id="rId3"
        Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/webSettings"
        Target="webSettings.xml" />
    <Relationship Id="rId7"
        Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/theme"
        Target="theme/theme1.xml" />
    <Relationship Id="rId2"
        Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/settings"
        Target="settings.xml" />
    <Relationship Id="rId1"
        Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles"
        Target="styles.xml" />
    <Relationship Id="rId6"
        Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/fontTable"
        Target="fontTable.xml" />
    <Relationship Id="rId5"
        Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/endnotes"
        Target="endnotes.xml" />
    <Relationship Id="rId4"
        Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/footnotes"
        Target="footnotes.xml" />
</Relationships>

这个rId并没有自己想的这么简单，主要是这个rId安排是按顺序的，整个页面的顺序。因此当从一个文件的图片复制到另一个文档中，需要考虑到rId冲突的问题。

解决思路

首先获取文档中的rid
再去document.xml.rels中找到源文件的信息比如Type，Target
然后我们要再去读取目标文件的document.xml.rels获取当前最高的rId数避免冲突
然后将文档中的rid相关信息加入到document.xml.rels中
然后修改文档中的rID添加到目标文档中

无法读取的报错

在这里插入图片描述

可疑点

[Content_Types].xml

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
    <Default Extension="gif" ContentType="image/gif" />
    <Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml" />
    <Default Extension="xml" ContentType="application/xml" />
    <Override PartName="/word/document.xml"
        ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml" />
    <Override PartName="/word/styles.xml"
        ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.styles+xml" />
    <Override PartName="/word/settings.xml"
        ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.settings+xml" />
    <Override PartName="/word/webSettings.xml"
        ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.webSettings+xml" />
    <Override PartName="/word/footnotes.xml"
        ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.footnotes+xml" />
    <Override PartName="/word/endnotes.xml"
        ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.endnotes+xml" />
    <Override PartName="/word/fontTable.xml"
        ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.fontTable+xml" />
    <Override PartName="/word/theme/theme1.xml"
        ContentType="application/vnd.openxmlformats-officedocument.theme+xml" />
    <Override PartName="/docProps/core.xml"
        ContentType="application/vnd.openxmlformats-package.core-properties+xml" />
    <Override PartName="/docProps/app.xml"
        ContentType="application/vnd.openxmlformats-officedocument.extended-properties+xml" />
</Types>

    <Default Extension="gif" ContentType="image/gif" />

是不是要加入这个模块来实现gif的读取，要不然word读不了那张gif？

经过手动输入这行文档就不会报错了，推测是因为没有加入gif对应的插件，文件word读取不出gif所以报错。因此我只需要在[Content_Types].xml中引入需要的扩展信息才不会报错。

wzp10

关注

18
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录