简介
Office OpenXML,也称为OpenXML或OOXML,是一种基于XML的办公文档格式,包括Word文档、Excel电子表格、PowerPoint演示文稿以及Chart(图表)、Diagram、Shape(形状)等。
该规范由微软开发,并在2006年被 ECMA International 采用为ECMA-376。该标准的第二个版本于2008年12月发布,第三个版本于2011年6月发布。本规范已被ISO和IEC采用为ISO/IEC 29500。
注意,Office Open XML、Open Office XML、Open Document Format在办公软件中是不同的竞争XML标准。
OOXML定义的内容有:
- 标记规范
ECMA-376提供三种主要规范——Word文档的WordprocessingML、Excel电子表格的SpreadsheetML和PowerPoint演示文稿的PresentationML。它还支持一些标记语言,最重要的是DrawingML,用于Drawing(绘图)、Shape(形状)和Chart(图表)。 - 文件打包规范
ECMA-376规定了开放打包约定Open Packaging Conventions(OPC)。OPC是一种容器文件技术,利用公共ZIP格式将文件组合成公共包。这种将数据分块的方法使数据访问更容易、更快,并降低了数据损坏的几率。
本文采用Python语言操纵Open XML。
常用文档:
安装
pip install lxml
pip install python-docx
pip install python-pptx
XML查看器
推荐使用 Notepad++
下载插件 XMLTools
在 Notepad++ 的 plugins
文件夹中新建文件夹 XMLTools
,将下载好的插件 XMLTools.dll
放在里面
下载插件 Compare
常用功能:
- Pretty print:Ctrl+Alt+Shift+B
- Compare:Ctrl+Alt+C
基本原理
- 元素均由Office Open XML组成
- 采用ZIP格式进行数据分包
- 读取对应元素的XML
- 修改底层XML,增加或修改节点
如图是某个Word文档,将后缀.docx
改为.zip
打开后的样子,实际上就是多个.xml
打包为.zip
。
另外,Word、PowerPoint另存为时,可以指定.xml
格式,将所有XML集合在一个文件中。
初试
任务描述:Word文档插入表格,并让图片填充满单元格
test.jpg
代码
from docx import Document
from docx.shared import Cm
document = Document()
table = document.add_table(2, 2, style='Table Grid')
cell1, cell2 = table.rows[0].cells
cell1.merge(cell2) # 合并单元格
cell1.text = '示例图像'
table.rows[1].height = Cm(5.0) # 第二行行高为5cm
detail_cell = table.cell(1, 0) # 图片描述单元格
detail_cell.text = '这是一只边牧'
picture_cell = table.cell(1, 1) # 图片单元格
paragraph = picture_cell.paragraphs[0]
run = paragraph.add_run()
picture = run.add_picture('test.jpg', height=Cm(5.0)) # 插入图片
picture_cell.width = Cm(picture.width.cm) # 单元格宽度设为图片的宽度
for cell in table.column_cells(1): # 第二列单元格
cell.width = Cm(5.0) # 第二列单元格宽度统一
document.save('test.docx')
效果
单元格宽度已设为和图像宽度一致,但左右仍有空白
造成这样的原因是单元格左右边距不为0:
查遍docx.table.Cell
对应的API文档也没有设置单元格边距的方法,因此需要修改底层XML
打开保存的test.docx
,另存为test.xml
手动修改图片单元格的单元格边距,将左右边距设为0 厘米
(注意!是单元格边距不是表格边距)
另存为test1.xml
进行对比
发现XML主要差异如下
<w:tc>
<w:tcPr>
<w:tcMar>
<w:left w:w="0" w:type="dxa"/>
<w:right w:w="0" w:type="dxa"/>
</w:tcMar>
</w:tcPr>
</w:tc>
在对应节点添加上所需XML节点即可
通过xxx._element.xml
查看XML
print(picture_cell._element.xml)
凡是使用OxmlElement
的均继承etree.ElementBase
,可以调用lxml.etree.ElementBase
的方法,常用的有:
xpath(xpath_str)
:通过XPath查找元素,返回列表append(element)
:追加元素find(path, namespaces=None)
:查找单个元素findall(path, namespaces=None)
:查找所有元素findtext(path, default=None, namespaces=None)
:根据文本查找元素get(key, default=None)
:获取属性值set(key, value)
:设置属性值
from docx.oxml.ns import nsmap
namespaces赋值为nsmap来使用
因为XML有命名空间的规定,有时候XML报错则需要进行封装
如SyntaxError: prefix ‘w’ not found in prefix map则需要指定namespaces
ValueError: Invalid attribute name 'w:w’则需要用qn()封装,实际结果为:{http://schemas.openxmlformats.org/wordprocessingml/2006/main}w
初始化XML元素,设置元素,一一追加回父节点
from docx import Document
from docx.shared import Cm
from docx.oxml.ns import nsmap
from docx.oxml.shared import OxmlElement, qn
document = Document()
table = document.add_table(2, 2, style='Table Grid')
cell1, cell2 = table.rows[0].cells
cell1.merge(cell2) # 合并单元格
cell1.text = '示例图像'
table.rows[1].height = Cm(5.0) # 第二行行高为5cm
detail_cell = table.cell(1, 0) # 图片描述单元格
detail_cell.text = '这是一只边牧'
picture_cell = table.cell(1, 1) # 图片单元格
paragraph = picture_cell.paragraphs[0]
run = paragraph.add_run()
picture = run.add_picture('test.jpg', height=Cm(5.0)) # 插入图片
picture_cell.width = Cm(picture.width.cm) # 单元格宽度设为图片的宽度
for cell in table.column_cells(1): # 第二列单元格
cell.width = Cm(5.0) # 第二列单元格宽度统一
print(picture_cell._element.xml)
tcPr = picture_cell._element.xpath('w:tcPr')[0] # picture_cell._element.find('w:tcPr', namespaces=nsmap)
tcMar = OxmlElement('w:tcMar')
left = OxmlElement('w:left')
right = OxmlElement('w:right')
left.set(qn('w:w'), str(0))
left.set(qn('w:type'), 'dxa')
right.set(qn('w:w'), str(0))
right.set(qn('w:type'), 'dxa')
tcMar.append(left)
tcMar.append(right)
tcPr.append(tcMar)
document.save('test.docx')
效果
获取XML
代码
from lxml import etree
from zipfile import ZipFile
zipfile = ZipFile(open('test.docx', 'rb'))
zipfile.printdir() # 查看有哪些.xml文件
print()
xml = zipfile.read('word/document.xml')
root = etree.fromstring(xml) # 构建etree
print(etree.tounicode(root, pretty_print=True)) # 格式化打印
结果
File Name Modified Size
[Content_Types].xml 1980-01-01 00:00:00 1312
_rels/.rels 1980-01-01 00:00:00 590
word/_rels/document.xml.rels 1980-01-01 00:00:00 817
word/document.xml 1980-01-01 00:00:00 1702
word/theme/theme1.xml 1980-01-01 00:00:00 6796
word/settings.xml 1980-01-01 00:00:00 2877
word/fontTable.xml 1980-01-01 00:00:00 1552
word/webSettings.xml 1980-01-01 00:00:00 497
docProps/app.xml 1980-01-01 00:00:00 711
docProps/core.xml 1980-01-01 00:00:00 739
word/styles.xml 1980-01-01 00:00:00 28638
<w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 w15 wp14">
<w:body>
<w:p w:rsidR="008042DE" w:rsidRDefault="00EB3B68">
<w:r>
<w:rPr>
<w:rFonts w:hint="eastAsia"/>
</w:rPr>
<w:t>H</w:t>
</w:r>
<w:r>
<w:t>ello World!</w:t>
</w:r>
<w:bookmarkStart w:id="0" w:name="_GoBack"/>
<w:bookmarkEnd w:id="0"/>
</w:p>
<w:sectPr w:rsidR="008042DE">
<w:pgSz w:w="11906" w:h="16838"/>
<w:pgMar w:top="1440" w:right="1800" w:bottom="1440" w:left="1800" w:header="851" w:footer="992" w:gutter="0"/>
<w:cols w:space="425"/>
<w:docGrid w:type="lines" w:linePitch="312"/>
</w:sectPr>
</w:body>
</w:document>
修改XML
修改XML并保存,涉及模块:
os
、shutil
、zipfile
、tempfile
:文件操作lxml
:修改XML
from lxml import etree
from zipfile import ZipFile
from docx.oxml.ns import nsmap
def qn(tag):
"""根据命名空间组装限定名"""
prefix, tagroot = tag.split(':')
uri = nsmap[prefix]
return '{%s}%s' % (uri, tagroot)
"""读取XML"""
zipread = ZipFile(open('test.docx', 'rb')) # 打开docx文件
xml_name = 'word/document.xml' # 需要修改的xml
xml = zipread.read(xml_name)
"""修改XML"""
root = etree.fromstring(xml)
tcPr = root.xpath('.//w:tcPr', namespaces=nsmap) # 查找单元格
tcPr = tcPr[-1] # 最后一个单元格
tcMar = etree.SubElement(tcPr, qn('w:tcMar'))
left = etree.SubElement(tcMar, qn('w:left'))
right = etree.SubElement(tcMar, qn('w:right'))
left.set(qn('w:w'), str(0))
left.set(qn('w:type'), 'dxa')
right.set(qn('w:w'), str(0))
right.set(qn('w:type'), 'dxa')
# print(etree.tounicode(root, pretty_print=True)) # 输出结果
"""写入文件"""
newdata = etree.tounicode(root)
zipwrite = ZipFile('test1.docx', 'w')
for item in zipread.infolist():
if item.filename != xml_name: # 将无关文件copy一遍
data = zipread.read(item.filename)
zipwrite.writestr(item.filename, data)
zipwrite.writestr(xml_name, newdata)
zipread.close()
zipwrite.close()
效果
推荐阅读:
- Using python to edit xml values(利用os、shutil、tempfile覆盖原文件)
Word相关
设置单元格边距
from docx import Document
from docx.shared import Cm
from docx.oxml.shared import OxmlElement, qn
def set_cell_margins(cell, **kwargs):
"""设置某单元格间距
长度单位为Twips,1Twips = 1/20pt,1Twips = 1/567cm
>>> set_cell_margins(table.cell(1, 0), top=0, start=0, bottom=0, end=0, left=0, right=0)
:param cell: 某单元格
:param top: 上边距
:param start: 左边距
:param bottom: 下边距
:param end: 右边距
:param left: 左边距(WPS)
:param right: 右边距(WPS)
"""
tc = cell._tc
tcPr = tc.get_or_add_tcPr()
tcMar = OxmlElement('w:tcMar')
for m in ['top', 'start', 'bottom', 'end', 'left', 'right']:
if m in kwargs:
node = OxmlElement('w:{}'.format(m))
node.set(qn('w:w'), str(kwargs.get(m)))
node.set(qn('w:type'), 'dxa')
tcMar.append(node)
tcPr.append(tcMar)
document = Document()
table = document.add_table(2, 2, style='Light List')
heading_cells = table.rows[0].cells
cell1, cell2 = heading_cells[0], heading_cells[-1]
cell1.merge(cell2) # 合并单元格
cell1.text = '示例图像'
table.rows[1].height = Cm(5.0) # 第二行行高为5cm
detail_cell = table.cell(1, 0) # 描述图片的单元格
picture_cell = table.cell(1, 1) # 放置图片的单元格
detail_cell.text = '这是一只边牧'
paragraph = picture_cell.paragraphs[0]
run = paragraph.add_run()
picture = run.add_picture('test.jpg', height=Cm(5.0)) # 插入图片
picture_cell.width = Cm(picture.width.cm) # 单元格宽度设为图片的宽度
min_width = min([cell.width.cm for cell in table.column_cells(1)]) # 第二列单元格的最小宽度
for cell in table.column_cells(1): # 第二列单元格
cell.width = Cm(min_width) # 第二列单元格宽度设为最小宽度
set_cell_margins(picture_cell, top=0, start=0, bottom=0, end=0, left=0, right=0) # 设置单元格间距
document.save('test.docx')
PowerPoint 相关
封装
opc-diag
可对比两文档的 XML
安装
pip install opc-diag
初试
opc diff before.docx after.docx
相当于(该函数直接重定向到 stdout)
from opcdiag.controller import OpcController
OpcController().diff_pkg('before.docx', 'after.docx')
或格式化输出
import subprocess
process = subprocess.run('opc diff before.docx after.docx', shell=True, stdout=subprocess.PIPE,
stderr=subprocess.STDOUT)
data = process.stdout.decode()
eval('print({})'.format(data.replace('b', '', 1)))
效果
--- before/word/document.xml
+++ after/word/document.xml
@@ -60,6 +60,10 @@
<w:tc>
<w:tcPr>
<w:tcW w:type="dxa" w:w="2835"/>
+ <w:tcMar>
+ <w:left w:w="0" w:type="dxa"/>
+ <w:right w:w="0" w:type="dxa"/>
+ </w:tcMar>
</w:tcPr>
<w:p>
<w:r>
XML压缩
其他
1. Word切换为英文界面
开发时有些字段需要英文版一一对应
文件 → 选项 → 语言 →选中【英语】 → 设为默认值 → 重启 Word
若没有该选项则需要下载 Office 2013 英语语言包
2. 好看的图表
from pptx.util import Cm, Pt
from pptx import Presentation
from pptx.dml.color import RGBColor
from pptx.chart.data import ChartData
from pptx.oxml.xmlchemy import OxmlElement
from pptx.enum.chart import XL_CHART_TYPE, XL_LEGEND_POSITION, XL_DATA_LABEL_POSITION
def set_chart_value_axis_no_fill(chart):
"""垂直(值)轴为无线条"""
element = chart.element
valAx = element.valAx_lst[0]
node = OxmlElement('c:spPr')
node1 = OxmlElement('a:ln')
node2 = OxmlElement('a:noFill')
node1.append(node2)
node.append(node1)
valAx.append(node)
def set_chart_category_axis_no_fill(chart):
"""水平(类别)轴为无线条"""
element = chart.element
catAx = element.catAx_lst[0]
node = OxmlElement('c:spPr')
node1 = OxmlElement('a:ln')
node2 = OxmlElement('a:noFill')
node1.append(node2)
node.append(node1)
catAx.append(node)
def set_chart_right_angle_corner(chart):
"""图表边框设为直角"""
element = chart.element
node = OxmlElement('c:roundedCorners')
node.set('val', str(0))
element.append(node)
def set_chart_spPr_fill_color(chart, hex='D9D9D9'):
"""图表边框设置颜色"""
element = chart.element
spPr = OxmlElement('c:spPr')
ln = OxmlElement('a:ln')
solidFill = OxmlElement('a:solidFill')
srgbClr = OxmlElement('a:srgbClr')
srgbClr.set('val', hex)
solidFill.append(srgbClr)
ln.append(solidFill)
spPr.append(ln)
element.append(spPr)
def set_chart_overlap(chart, val=-27):
"""图表系列重叠"""
element = chart.element
plotArea = element.plotArea
barChart = plotArea.xpath('c:barChart')
barChart = barChart[0]
overlap = OxmlElement('c:overlap')
overlap.set('val', str(val))
barChart.append(overlap)
def set_chart_gapWidth(chart, val=219):
"""图表分类间距"""
element = chart.element
plotArea = element.plotArea
barChart = plotArea.xpath('c:barChart')
barChart = barChart[0]
gapWidth = OxmlElement('c:gapWidth')
gapWidth.set('val', str(val))
barChart.append(gapWidth)
presentation = Presentation()
title_only_slide = presentation.slide_layouts[5]
slide = presentation.slides.add_slide(title_only_slide)
shapes = slide.shapes
shapes.title.text = ' '
"""图表逻辑"""
chart_data = ChartData()
chart_data.categories = ['类别 1', '类别 2', '类别 3', '类别 4']
chart_data.add_series('系列 1', (4.3, 2.5, 3.5, 4.5))
chart_data.add_series('系列 2', (2.4, 4.4, 1.8, 2.8))
chart_data.add_series('系列 3', (2, 2, 3, 5))
x, y, cx, cy = Cm(4), Cm(4), Cm(17), Cm(11)
graphic_frame = shapes.add_chart(XL_CHART_TYPE.COLUMN_CLUSTERED, x, y, cx, cy, chart_data)
chart = graphic_frame.chart
chart.has_legend = True # 显示图例
chart.legend.position = XL_LEGEND_POSITION.BOTTOM # 图例位置靠下
chart.legend.include_in_layout = False # 图例不与图表重叠
chart.legend.font.size = Pt(9) # 图例字体大小
chart.legend.font.color.rgb = RGBColor(89, 89, 89) # 图例字体颜色
plot = chart.plots[0]
plot.has_data_labels = True # 显示数据标签
data_labels = plot.data_labels
data_labels.font.size = Pt(9) # 数据标签字体大小
data_labels.position = XL_DATA_LABEL_POSITION.OUTSIDE_END # 数据标签位置在数据标签外
for series in chart.series:
series.data_labels.show_value = True
series.data_labels.font.size = Pt(9) # 系列数据标签字体大小
chart.has_title = True # 显示标题
chart_title = chart.chart_title
text_frame = chart_title.text_frame
text_frame.text = '图表标题'
paragraphs = text_frame.paragraphs # 提取标题中的段落
paragraph = paragraphs[0]
paragraph.font.size = Pt(18.6) # 标题字体大小
paragraph.font.color.rgb = RGBColor(89, 89, 89) # 标题字体颜色
category_axis = chart.category_axis # 水平(类别)轴
category_axis.tick_labels.font.size = Pt(12) # 水平轴字体大小
category_axis.tick_labels.font.color.rgb = RGBColor(89, 89, 89) # 水平轴字体颜色
value_axis = chart.value_axis # 垂直(值)轴
value_axis.tick_labels.font.size = Pt(12) # 垂直轴字体大小
value_axis.tick_labels.font.color.rgb = RGBColor(89, 89, 89) # 垂直轴字体颜色
value_axis.major_gridlines.format.line.color.rgb = RGBColor(217, 217, 217) # 垂直轴主要网格线颜色
set_chart_value_axis_no_fill(chart) # 垂直轴为无线条
set_chart_category_axis_no_fill(chart) # 水平轴为无线条
set_chart_right_angle_corner(chart) # 图表边框设为直角
set_chart_spPr_fill_color(chart) # 图表边框设置颜色
set_chart_overlap(chart, -27) # 图表系列重叠
set_chart_gapWidth(chart, 219) # 图表分类间距
presentation.save('test.pptx')
效果
参考文献
- Office Open XML
- OpenXml.Wordprocessing文档
- OpenXml.Spreadsheet文档
- OpenXml.Presentation文档
- python-docx Documentation
- python-pptx Documentation
- lxml Documentation
- opc-diag Documentation
- How to set cell margins of tables in ms word using python docx
- Word文件的OpenXML解析(以Python3为例)
- xml - How can I save an edited Word document with Python?
- SyntaxError: prefix ‘a’ not found in prefix map
- Using python to edit xml values
- How to set the Value Axis of a chart to No line
- Python docx paragraph in textbox