html text to word,python - HTML to Word docx - Stack Overflow

I just wrote a bit of code to convert the text from a tkinter Text widget over to a word document, including any bold tags that the user can add. This isn't a complete solution for you, but it may help you to start toward a working solution. I think you're going to have to do some regex work to get the hyperlinks transferred to the word document. Stacked formatting tags may also get tricky. I hope this helps:

from docx import Document

html = 'HTML string here.'

html = html.split('

html = [html[0]] + ['

doc = Document()

p = doc.add_paragraph()

for run in html:

if run.startswith(''):

run = run.lstrip('')

runner = p.add_run(run)

runner.bold = True

elif run.startswith(''):

run = run.lstrip('')

runner = p.add_run(run)

else:

p.add_run(run)

doc.save('test.docx')

I came back to it and made it possible to parse out multiple formatting tags. This will keep a tally of what formatting tags are in play in a list. At each tag, a new run is created, and formatting for the run is set by the current tags in play.

from docx import Document

import re

import docx

from docx.shared import Pt

from docx.enum.dml import MSO_THEME_COLOR_INDEX

def add_hyperlink(paragraph, text, url):

# This gets access to the document.xml.rels file and gets a new relation id value

part = paragraph.part

r_id = part.relate_to(url, docx.opc.constants.RELATIONSHIP_TYPE.HYPERLINK, is_external=True)

# Create the w:hyperlink tag and add needed values

hyperlink = docx.oxml.shared.OxmlElement('w:hyperlink')

hyperlink.set(docx.oxml.shared.qn('r:id'), r_id, )

# Create a w:r element and a new w:rPr element

new_run = docx.oxml.shared.OxmlElement('w:r')

rPr = docx.oxml.shared.OxmlElement('w:rPr')

# Join all the xml elements together add add the required text to the w:r element

new_run.append(rPr)

new_run.text = text

hyperlink.append(new_run)

# Create a new Run object and add the hyperlink into it

r = paragraph.add_run ()

r._r.append (hyperlink)

# A workaround for the lack of a hyperlink style (doesn't go purple after using the link)

# Delete this if using a template that has the hyperlink style in it

r.font.color.theme_color = MSO_THEME_COLOR_INDEX.HYPERLINK

r.font.underline = True

return hyperlink

html = '

I want to

convert HTML to docx in bold and bold italic.'

html = html.split('

html = [html[0]] + ['

tags = []

doc = Document()

p = doc.add_paragraph()

for run in html:

tag_change = re.match('(?:)', run)

if tag_change != None:

tag_strip = tag_change.group(0)

tag_change = tag_change.group(1)

if tag_change.startswith('/'):

if tag_change.startswith('/a'):

tag_change = next(tag for tag in tags if tag.startswith('a '))

tag_change = tag_change.strip('/')

tags.remove(tag_change)

else:

tags.append(tag_change)

else:

tag_strip = ''

hyperlink = [tag for tag in tags if tag.startswith('a ')]

if run.startswith('

run = run.replace(tag_strip, '')

if hyperlink:

hyperlink = hyperlink[0]

hyperlink = re.match('.*?(?:href=")(.*?)(?:").*?', hyperlink).group(1)

add_hyperlink(p, run, hyperlink)

else:

runner = p.add_run(run)

if 'b' in tags:

runner.bold = True

if 'u' in tags:

runner.underline = True

if 'i' in tags:

runner.italic = True

if 'H1' in tags:

runner.font.size = Pt(24)

else:

p.add_run(run)

doc.save('test.docx')

Hyperlink function thanks to this question. My concern here is that you will need to manually code for every HTML tag that you want to carry over to the docx. I imagine that could be a large number. I've given some examples of tags you may want to account for.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值