python如何处理解析word文档doc docx , python-docx，python-docx2txt，zipfile

最新推荐文章于 2024-08-15 10:05:47 发布

Shawn.Hu

最新推荐文章于 2024-08-15 10:05:47 发布

阅读量1.5w

点赞数 2

分类专栏： python

本文链接：https://blog.csdn.net/hshl1214/article/details/50413968

版权

python 专栏收录该内容

84 篇文章 1 订阅

订阅专栏

关于python如何处理word文档doc docx，可以关注 python-docx 和 python-docx2txt 两个项目，python-docx复杂一些，适合创建文档，python-docx2txt可以方便将文档转换成txt：

https://python-docx.readthedocs.org/en/latest/

https://github.com/python-openxml/python-docx

另外doc文件本身是个压缩文件，实际文档内容是xml结构的，可使用unzip解压：

# unzip test.docx
Archive: test.docx
inflating: _rels/.rels
inflating: word/settings.xml
inflating: word/_rels/document.xml.rels
inflating: word/fontTable.xml
inflating: word/styles.xml
inflating: word/document.xml
inflating: docProps/app.xml
inflating: docProps/core.xml
inflating: [Content_Types].xml
# ls
[Content_Types].xml docProps _rels test.docx word

# ls
document.xml fontTable.xml _rels settings.xml styles.xml

# cat document.xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing"><w:body><w:p><w:pPr><w:pStyle w:val="Heading2"/><w:spacing w:lineRule="auto" w:line="240" w:before="0" w:after="0"/><w:rPr></w:rPr></w:pPr><w:r><w:rPr></w:rPr></w:r></w:p><w:p><w:pPr><w:pStyle w:val="Heading5"/><w:spacing w:lineRule="auto" w:line="240"/><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/><w:b w:val="false"/><w:sz w:val="24"/><w:szCs w:val="24"/></w:rPr></w:pPr><w:r><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/><w:b w:val="false"/><w:sz w:val="24"/><w:szCs w:val="24"/></w:rPr><w:t>Summary:02</w:t></w:r><w:r><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/><w:b w:val="false"/><w:sz w:val="24"/><w:szCs w:val="24"/></w:rPr><w:t>系统基本功能</w:t></w:r><w:r><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/><w:b w:val="false"/><w:sz w:val="24"/><w:szCs w:val="24"/></w:rPr><w:t>-01</w:t></w:r><w:r><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/><w:b w:val="false"/><w:sz w:val="24"/><w:szCs w:val="24"/></w:rPr><w:t>系统核心功能</w:t></w:r><w:r>

不使用现成库可以使用zipfile直接解压：

import zipfile

document = zipfile.ZipFile('test.docx')
xml_content = document.read('word/document.xml')
reparsed = minidom.parseString(xml_content)
print reparsed.toprettyxml(indent=" " , encoding="utf-8")