python实现docx的批注(comments)插入

项目需要实现自动在docx中插入批注,首选为python,python中有docx库,但是到目前为止还是未支持插入批注功能,但是在python-docx项目中,有人提出了这个问题,作者scanny给出了相关指导

总结一下大致思路为:解压docx文件后会得到很多文件及文件夹,对比插入批注和未插入批注的解压文件发现:插入批注会新增一个word/comments.xml文件,并且会修改word/_rels/document.xml.rels和word/document.xml,后续插入新的标注只会修改word/comments.xml和word/document.xml。所以只需要搞清楚document.xml.rels、comments.xml、document.xml的变化规律,就可以实现批注插入的自动化。

大家可以尝试将docx文件重命名为.zip,然后解压,手动修改里面的文件信息,再压缩回.zip,再重命名为docx,关于压缩回.zip可能出现的问题,参考这里

以下为未插入批注解压文件结构:

以下为插入批注的文件结构:

最明显的区别在于新增了word/comments.xml文件其次还有word/_rels/document.xml.rels、word/document.xml内容的变化。

首先对比word/_rels/document.xml.rels文件内容的变化

插入批注前:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships"><Relationship Id="rId5" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/fontTable" Target="fontTable.xml"/><Relationship Id="rId4" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/customXml" Target="../customXml/item1.xml"/><Relationship Id="rId3" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/theme" Target="theme/theme1.xml"/><Relationship Id="rId2" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/settings" Target="settings.xml"/><Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles" Target="styles.xml"/></Relationships>

插入批注后:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships"><Relationship Id="rId6" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/fontTable" Target="fontTable.xml"/><Relationship Id="rId5" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/customXml" Target="../customXml/item1.xml"/><Relationship Id="rId4" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/theme" Target="theme/theme1.xml"/><Relationship Id="rId3" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/comments" Target="comments.xml"/><Relationship Id="rId2" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/settings" Target="settings.xml"/><Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles" Target="styles.xml"/></Relationships>

其次对比word/document.xml内容变化:

插入前:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" xmlns:wpsCustomData="http://www.wps.cn/officeDocument/2013/wpsCustomData" mc:Ignorable="w14 w15 wp14"><w:body><w:p><w:r><w:t>这是一段文本,等待插入批注</w:t></w:r><w:bookmarkStart w:id="0" w:name="_GoBack"/><w:bookmarkEnd w:id="0"/></w:p><w:sectPr><w:pgSz w:w="11906" w:h="16838"/><w:pgMar w:top="1440" w:right="1800" w:bottom="1440" w:left="1800" w:header="851" w:footer="992" w:gutter="0"/><w:cols w:space="425" w:num="1"/><w:docGrid w:type="lines" w:linePitch="312" w:charSpace="0"/></w:sectPr></w:body></w:document>

插入后:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" xmlns:wpsCustomData="http://www.wps.cn/officeDocument/2013/wpsCustomData" mc:Ignorable="w14 w15 wp14"><w:body><w:p><w:r><w:t>这是一段</w:t></w:r><w:commentRangeStart w:id="0"/><w:r><w:t>文本</w:t></w:r><w:commentRangeEnd w:id="0"/><w:r><w:commentReference w:id="0"/></w:r><w:r><w:t>,等待插入批注</w:t></w:r><w:bookmarkStart w:id="0" w:name="_GoBack"/><w:bookmarkEnd w:id="0"/></w:p><w:sectPr><w:pgSz w:w="11906" w:h="16838"/><w:pgMar w:top="1440" w:right="1800" w:bottom="1440" w:left="1800" w:header="851" w:footer="992" w:gutter="0"/><w:cols w:space="425" w:num="1"/><w:docGrid w:type="lines" w:linePitch="312" w:charSpace="0"/></w:sectPr></w:body></w:document>

对比插入一个批注和插入两个批注的区别:

插入一个:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:comments xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" xmlns:wpsCustomData="http://www.wps.cn/officeDocument/2013/wpsCustomData" mc:Ignorable="w14 w15 wp14"><w:comment w:id="0" w:author="guochuanxiang" w:date="2019-03-14T14:46:32Z" w:initials="g"><w:p><w:pPr><w:pStyle w:val="2"/></w:pPr><w:r><w:t>这是一个批注</w:t></w:r></w:p></w:comment></w:comments>

插入两个:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:comments xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" xmlns:wpsCustomData="http://www.wps.cn/officeDocument/2013/wpsCustomData" mc:Ignorable="w14 w15 wp14"><w:comment w:id="0" w:author="guochuanxiang" w:date="2019-03-14T14:46:32Z" w:initials="g"><w:p><w:pPr><w:pStyle w:val="2"/></w:pPr><w:r><w:t>这是一个批注</w:t></w:r></w:p></w:comment><w:comment w:id="1" w:author="guochuanxiang" w:date="2019-03-14T14:52:47Z" w:initials="g"><w:p><w:pPr><w:pStyle w:val="2"/></w:pPr><w:r><w:t>这是第二个批注</w:t></w:r></w:p></w:comment></w:comments>

区别大家可以自己尝试。

不多说,上实现代码:

运行python3 [code.py代码文件名] [docx文件路径] [需要被批注的文本内容] [批注内容]

例: python3 insert_comments.py /Users/guochuanxiang/Desktop/comments.docx 文本 批注

# coding:utf-8

import sys
from zipfile import ZipFile
import os
import shutil
import re


def write_comments(comments_file_content, comments):  # comments: [被批注文本,批注]
    comments_id = comments[2]
    print ('generate comments.xml content....')
    tmp = '<w:comment w:id="{}" w:author="guochuanxiang" w:date="2019-03-13T15:10:06Z" w:initials="g"><w:p><w:pPr><w:pStyle w:val="2"/></w:pPr><w:r><w:t>{}</w:t></w:r></w:p></w:comment></w:comments>'.format(comments_id, comments[1])
    content_comments = comments_file_content[:-13]+tmp
    return content_comments


def write_document(document_file_content, comments):
    comments_id = comments[2]
    print ('generate document.xml content....')
    tmp = '</w:t></w:r><w:commentRangeStart w:id="{}"/><w:r><w:rPr><w:rFonts w:hint="eastAsia"/></w:rPr><w:t>{}</w:t></w:r><w:commentRangeEnd w:id="{}"/><w:r><w:commentReference w:id="{}"/></w:r><w:r><w:rPr><w:rFonts w:hint="eastAsia"/></w:rPr><w:t>'.format(comments_id,comments[0],comments_id,comments_id)
    content_document = document_file_content.replace(comments[0],tmp,1)
    return content_document


def write_rel(rel_file_content, comments):
    if rel_file_content.find('comments.xml') == -1:
        print ("not find comments.xml")
        content_rel = rel_file_content[:-16]+'<Relationship Id="{}" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/comments" Target="comments.xml"/></Relationships>'.format('rId9')
        print(content_rel)
        return content_rel
    else:
        print('get comments.xml in rels file')
        return rel_file_content

def run(file_path='/Users/guochuanxiang/Desktop/test.docx',comments=['内容', '批注1']):

    doc_file = open(file_path, 'rb')
    doc = ZipFile(doc_file)
    doc.extractall()    #解压文件
    print ('extracting....')
    file_name = doc.namelist()  #获取所有文件名
    if 'word/comments.xml' not in file_name:
        print ('create comments.xml')
        comments_file = '<?xml version="1.0" encoding="UTF-8" standalone="yes"?>\n<w:comments xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" xmlns:wpsCustomData="http://www.wps.cn/officeDocument/2013/wpsCustomData" mc:Ignorable="w14 w15 wp14"></w:comments>'
        comments.append(0)
    else:
        comments_file = doc.read('word/comments.xml').decode('utf-8')   #获取comments.xml内容
        comment_id = re.compile(r'(?<=id=")\d+')    #寻找所有comments id
        comment_id = int(max(comment_id.findall(comments_file)))+1 #设置批注id为最大+1
        comments.append(comment_id)
    document_file = doc.read('word/document.xml').decode('utf-8')   #获取document.xml内容
    rel_file = doc.read('word/_rels/document.xml.rels').decode('utf-8')     #获取rel内容
    doc.close()
    doc_file.close()

    comments_g = write_comments(comments_file, comments)  #获取添加批注后comments.xml内容
    document = write_document(document_file, comments)  #获取添加批注后doucment.xml内容
    rel = write_rel(rel_file, comments) #获取添加批注后rel内容
    print ('get all content')
    print('writing document.xml.rels...')
    r_f = open('word/_rels/document.xml.rels','w')
    r_f.write(rel)
    r_f.close()
    print('done')
    print ('writing comments.xml...')
    c_f = open('word/comments.xml','w') #将插入批注的comment内容写入comments.xml
    c_f.write(comments_g)
    c_f.close()
    print('done')
    print('writing document.xml....')   #将插入批注的document内容写入document.xml
    d_f = open('word/document.xml','w')
    d_f.write(document)
    d_f.close()
    print('done')
    os.remove(file_path)     #删除原docx

    print('creat commented docx....')
    new_file = ZipFile(doc.filename,mode='w')   #新建空docx
    if 'word/comments.xml' not in file_name:
        print ('add {}'.format('word/comments.xml'))
        new_file.write('word/comments.xml')
    try:
        for name in file_name:
            if os.path.isfile(name):
                print('add {}'.format(name))
                new_file.write(name)    #将文件压缩回docx
    finally:
        print('closing')
        new_file.close()
    for name in file_name:
        if os.path.exists(name):
            if os.path.isfile(name):
                os.remove(name)
            else:
                shutil.rmtree(name)
    print('done')


if __name__ == '__main__':
    file_path = sys.argv[1]
    text = sys.argv[2]
    comment = sys.argv[3]
    comments = [text,comment]
    print (comments)
    run(file_path,comments)

总结:按scanny的说法,python-docx有提供在xml里插入内容的方法,但是我没用过这个模块,所以没有深究如何用docx实现,目前这种实现方法有局限性,如果一段文本被批注多次可能会出现问题,可能需要使用docx模块的插入方法可以解决,大佬们可以尝试一下

想深入了解docx文档结构,可以点击这里

  • 2
    点赞
  • 19
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值