python读取word文档报错:KeyError: “There is no item named ‘word/NULL‘ in the archive“

使用docx读取word文档时出错

import docx
document = docx.Document("word.docx")

出错日志:

Traceback (most recent call last):
  File "D:\cv\test\test\test85.py", line 19, in <module>
    document = docx.Document(src)
  File "C:\Users\hxm\AppData\Roaming\Python\Python310\site-packages\docx\api.py", line 25, in Document
    document_part = Package.open(docx).main_document_part
  File "C:\Users\hxm\AppData\Roaming\Python\Python310\site-packages\docx\opc\package.py", line 128, in open
    pkg_reader = PackageReader.from_file(pkg_file)
  File "C:\Users\hxm\AppData\Roaming\Python\Python310\site-packages\docx\opc\pkgreader.py", line 35, in from_file
    sparts = PackageReader._load_serialized_parts(
  File "C:\Users\hxm\AppData\Roaming\Python\Python310\site-packages\docx\opc\pkgreader.py", line 69, in _load_serialized_parts
    for partname, blob, reltype, srels in part_walker:
  File "C:\Users\hxm\AppData\Roaming\Python\Python310\site-packages\docx\opc\pkgreader.py", line 110, in _walk_phys_parts
    for partname, blob, reltype, srels in next_walker:
  File "C:\Users\hxm\AppData\Roaming\Python\Python310\site-packages\docx\opc\pkgreader.py", line 105, in _walk_phys_parts
    blob = phys_reader.blob_for(partname)
  File "C:\Users\hxm\AppData\Roaming\Python\Python310\site-packages\docx\opc\phys_pkg.py", line 108, in blob_for
    return self._zipf.read(pack_uri.membername)
  File "D:\soft\miniconda3\lib\zipfile.py", line 1475, in read
    with self.open(name, "r", pwd) as fp:
  File "D:\soft\miniconda3\lib\zipfile.py", line 1514, in open
    zinfo = self.getinfo(name)
  File "D:\soft\miniconda3\lib\zipfile.py", line 1441, in getinfo
    raise KeyError(
KeyError: "There is no item named 'word/NULL' in the archive"

word的docx格式的文档,实际是zip格式文件,可用zip工具解压。所以,修改后缀名.zip,然后解压文件,得到解压后的文件,大部分为xml文件。

其中word/document.xml是主文件,但我这里出错的跟这个文件不相关,主要是查看word/_rels/document.xml.rels,这个文件主要是管理文档中的图片引用情况,查看详情可能会看到如下类似Target=“NULL”的行。

<n1:Relationship Id="rId19" Type="http://schemas.microsoft.com/office/2007/relationships/hdphoto" Target="NULL"/>

可以手动删除并保存,或者通过编写代码删除该行后,问题得到解决。

.docx为后缀名的word文档解压后的内部结构,更多的可参考docx格式文档详解:xml解析html还原

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值