使用docx读取word文档时出错
import docx
document = docx.Document("word.docx")
出错日志:
Traceback (most recent call last):
File "D:\cv\test\test\test85.py", line 19, in <module>
document = docx.Document(src)
File "C:\Users\hxm\AppData\Roaming\Python\Python310\site-packages\docx\api.py", line 25, in Document
document_part = Package.open(docx).main_document_part
File "C:\Users\hxm\AppData\Roaming\Python\Python310\site-packages\docx\opc\package.py", line 128, in open
pkg_reader = PackageReader.from_file(pkg_file)
File "C:\Users\hxm\AppData\Roaming\Python\Python310\site-packages\docx\opc\pkgreader.py", line 35, in from_file
sparts = PackageReader._load_serialized_parts(
File "C:\Users\hxm\AppData\Roaming\Python\Python310\site-packages\docx\opc\pkgreader.py", line 69, in _load_serialized_parts
for partname, blob, reltype, srels in part_walker:
File "C:\Users\hxm\AppData\Roaming\Python\Python310\site-packages\docx\opc\pkgreader.py", line 110, in _walk_phys_parts
for partname, blob, reltype, srels in next_walker:
File "C:\Users\hxm\AppData\Roaming\Python\Python310\site-packages\docx\opc\pkgreader.py", line 105, in _walk_phys_parts
blob = phys_reader.blob_for(partname)
File "C:\Users\hxm\AppData\Roaming\Python\Python310\site-packages\docx\opc\phys_pkg.py", line 108, in blob_for
return self._zipf.read(pack_uri.membername)
File "D:\soft\miniconda3\lib\zipfile.py", line 1475, in read
with self.open(name, "r", pwd) as fp:
File "D:\soft\miniconda3\lib\zipfile.py", line 1514, in open
zinfo = self.getinfo(name)
File "D:\soft\miniconda3\lib\zipfile.py", line 1441, in getinfo
raise KeyError(
KeyError: "There is no item named 'word/NULL' in the archive"
word的docx格式的文档,实际是zip格式文件,可用zip工具解压。所以,修改后缀名.zip,然后解压文件,得到解压后的文件,大部分为xml文件。
其中word/document.xml是主文件,但我这里出错的跟这个文件不相关,主要是查看word/_rels/document.xml.rels,这个文件主要是管理文档中的图片引用情况,查看详情可能会看到如下类似Target=“NULL”的行。
<n1:Relationship Id="rId19" Type="http://schemas.microsoft.com/office/2007/relationships/hdphoto" Target="NULL"/>
可以手动删除并保存,或者通过编写代码删除该行后,问题得到解决。
.docx为后缀名的word文档解压后的内部结构,更多的可参考docx格式文档详解:xml解析html还原