Python如何提取docx中的超链接

最新推荐文章于 2024-08-24 10:34:22 发布

是晨星啊

最新推荐文章于 2024-08-24 10:34:22 发布

阅读量4.1k

点赞数 2

分类专栏： Python学习文章标签： python 正则表达式

本文链接：https://blog.csdn.net/s1162276945/article/details/102919305

版权

Python学习专栏收录该内容

92 篇文章 1 订阅

订阅专栏

Python如何解析 <w:t></w:t>中间的内容
用 xml + 正则表达式
如果仅仅使用 for paragraph in document.paragraphs 获取不包含表格的段落时，还应加上.text属性

import re
from docx import Document


def get_paragraph_from_docx(file_name):
    """
    网址：https:blog.csdn.net，这是一段有hyperlink的段落
    这是一段没有hyperlink的段落
    可用于处理包含超链接的文本，但会自动跳过表格
    :param file_name:
    :return:
    """
    text = []
    document = Document(file_name)
    for paragraph in document.paragraphs:
        t_para = u""
        # 有无超链接均可处理
        xml_str = str(paragraph.paragraph_format.element.xml)
        wt_list = re.findall('<w:t[\S\s]*?</w:t>', xml_str)
        for wt in wt_list:
            wt_content = re.sub('<[\S\s]*?>', u"", wt)
            t_para += wt_content
        if t_para:
            t_para = t_para.strip()
            t_para = re.sub('[\s]', '', t_para)
            if t_para:
                text.append(t_para)
    return text

d = docx.Document(./test.docx)
for p in d.paragraphs:
	xml = p.paragraph_format.element.xml
	xml_str = str(xml)
	wt_list = re.findall('<w:t[\S\s]*?</w:t>', xml_str)
	hyperlink = u''
	for wt in wt_list:
		wt_content = re.sub('<[\S\s]*?>', u'', wt)
		hyperlink += wt_content
	print(hyperlink)