python docx runs_使用python-docx突出显示docx文件中的单词会产生错误的结果

本文档讲述了在Python中使用python-docx库尝试高亮docx文件中特定单词时遇到的问题,包括句子重复、格式丢失、结尾符号丢失等,并探讨了可能的解决思路,如通过字符偏移定位目标字符串所在的run,以及如何处理格式不一致的情况。建议逐步解决问题,先从丢失结尾符号开始,并通过打印run文本以理解其分割方式。
摘要由CSDN通过智能技术生成

I would like to highlight specific words in an MS word document (here given as negativeList) and leave the rest of the document as it was before. I have tried to adopt from this one but I can not get it running as it should:

from docx.enum.text import WD_COLOR_INDEX

from docx import Document

import pandas as pd

import copy

import re

doc = Document(docxFileName)

negativList = ["king", "children", "lived", "fire"] # some examples

for paragraph in doc.paragraphs:

for target in negativList:

if target in paragraph.text: # it is worth checking in detail ...

currRuns = copy.copy(paragraph.runs) # deep copy as we delete/clear the object

paragraph.runs.clear()

for run in currRuns:

if target in run.text:

words = re.split('(\W)', run.text) # split into words in order to be able to color only one

for word in words:

if word == target:

newRun = paragraph.add_run(word)

newRun.font.highlight_color = WD_COLOR_INDEX.PINK

else:

newRun = paragraph.add_run(word)

newRun.font.highlight_color = None

else: # our target is not in it so we add it unchanged

paragraph.runs.append(run)

doc.save('output.docx')

As example I am using this text (in a word docx file):

CHAPTER 1

Centuries ago there lived --

"A king!" my little readers will say immediately.

No, children, you are mistaken. Once upon a time there was a piece of

wood. It was not an expensive piece of wood. Far from it. Just a

common block of firewood, one of those thick, solid logs that are put

on the fire in winter to make cold rooms cozy and warm.

There are multiple problems with my code:

1) The first sentence works but the second sentence is in twice. Why?

2) The format gets somehow lost in the part where I highlight. I would possibly need to copy the properties of the original run into the newly created ones but how do I do this?

3) I loose the terminal "--"

4) In the highlighted last paragraph the "cozy and warm" is missing ...

What I would need is a eighter a fix for these problems or maybe I am overthinking it and there is a much easier way to do the highlighting? (something like doc.highlight({"king": "pink"} but I haven't found anything in the documentation)?

解决方案

You're not overthinking it, this is a challenging problem; it is a form of the search-and-replace problem.

The target text can be located fairly easily by searching Paragraph.text, but replacing it (or in your case adding formatting) while retaining other formatting requires access at the Run level, both of which you've discovered.

There are some complications though, which is what makes it challenging:

There is no guarantee that your "find" target string is located entirely in a single run. So you will need to find the run containing the start of your target string and the run containing the end of your target string, as well as any in-between.

This might be aided by using character offsets, like "King" appears at character offset 3 in '"A king!" ...', and has a length of 4, then identifying which run contains character 3 and which contains character (3+4).

Related to the first complication, there is no guarantee that all the runs in which the target string partly appears are formatted the same. For example, if your target string was "a bold word", the updated version (after adding highlighting) would require at least three runs, one for "a ", one for "bold", and one for " word" (btw, which run each of the two space characters appear in won't change how they appear).

If you accept the simplification that the target string will always be a single word, you can consider the simplification of giving the replacement run the formatting of the first character (first run) of the found target runs, which is probably the usual approach.

So I suppose there are a few possible approaches, but one would be to "normalize" the runs of each paragraph containing the target string, such that the target string appeared within a distinct run. Then you could just apply highlighting to that run and you'd get the result you wanted.

To be of more help, you'll need to narrow down the problem areas and provide specific inputs and outputs. I'd start with the first one (perhaps losing the "--") (in a separate question, perhaps linked from here) and then proceed one by one until it all works. It's asking too much for a respondent to produce their own test case :)

Then you'd have a question like: "I run the string: 'Centuries ago ... --' through this code and the trailing "--" disappears ...", which is a lot easier for folks to reason through.

Another good next step might be to print out the text of each run, just so you get a sense of how they're broken up. That may give you insight into where it's not working.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值