顺序遍历docx文档

顺序遍历doc文档的核心代码如下:

from docx import Document
from docx.document import Document as _Document
from docx.oxml.text.paragraph import CT_P
from docx.oxml.table import CT_Tbl
from docx.table import _Cell, Table
from docx.text.paragraph import Paragraph
"""
   Generate a reference to each paragraph and table child within *parent*,
   in document order. Each returned value is an instance of either Table or
   Paragraph. *parent* would most commonly be a reference to a main
   Document object, but also works for a _Cell object, which itself can
   contain paragraphs and tables.
"""
def iter_block_items(parent):
    if isinstance(parent, _Document):
        parent_elm = parent.element.body
    elif isinstance(parent, _Cell):
        parent_elm = parent._tc
    else:
        raise ValueError("something's not right")
    for child in parent_elm.iterchildren():
        if isinstance(child, CT_P):
            yield Paragraph(child, parent)
        elif isinstance(child, CT_Tbl):
            yield Table(child, parent)

"""
main function to extract tables
"""
def extract_tables(document):
    count = 0
    current_context=''
    #iterator the blocks in doc
    for block in iter_block_items(document):
        # print(block.text if isinstance(block, Paragraph) else '<table>')
        if isinstance(block, Paragraph):
            # print("------------------text--------------------")
            print("text:  " + block.text)
        elif isinstance(block, Table):
            current_context=''
            for row in block.rows:
                row_data = []
                for cell in row.cells:
                    text_cell=''
                    for paragraph in cell.paragraphs:
                        text_cell += paragraph.text.strip()
                    if text_cell is '':
                        text_cell="NULL"
                    row_data.append(text_cell)
                print("|".join(row_data))
if __name__ == '__main__':
    document = Document('./xxxxx.docx')
    extract_tables(document)

以上代码核心思想是顺序取出docx中的每个block然后判断该block是table还是paragraph对象,如果是table在解析table,将里面的内容按行输出。

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值