我按段落迭代文档,然后按.(带空格的点)将每个段落文本拆分成句子。我把段落文本分成句子In,这样做比在整个段落文本中搜索更有效。在
然后代码在句子的每个单词中搜索错误,从纠错数据库中提取错误。下面是一个简化代码:from docx.enum.text import WD_BREAK
for paragraph in document.paragraphs:
sentences = paragraph.text.split('. ')
for sentence in sentences:
words=sentence.split(' ')
for word in words:
for error in error_dictionary:
if error in word:
# (A) make simple replacement
word = word.replace(error, correction, 1)
# (B) alternative replacement based on runs
for run in paragraph.runs:
if error in run.text:
run.text = run.text.replace(error, correction, 1)
# here we may fetch page break attribute and knowing current number
# find out at what page the replacement has taken place
if run.page_break== WD_BREAK:
current_page_number +&