首先:对Microsoft Word 97 - 2003 版本,内容大致如下的文档:
无论采用"gbk"还是"utf-8"编码都无法正常读取文档:
file=r"C:\Users\Wu\Desktop\待处理文档.doc"
work_book=open(file,encoding="gbk")
write_in=open(r"C:\Users\Wu\Desktop\待写入文档.txt","w")
报错(“utf-8,utf-8-sig也是类似报错”):
UnicodeDecodeError: ‘gbk’ codec can’t decode byte 0xb1 in position 5: illegal multibyte sequence。
解决办法:
1.将word文档复制到文本文档(.txt)中,采用“utf-8”编码读取文档。
示例代码如下(用正则表达式处理文档):
import re
file=r"C:\Users\Wu\Desktop\待处理文档.txt"
work_book=open(file,encoding="utf-8")
write_in=open(r"C:\Users\Wu\Desktop\待写入文档.txt","w")
pattern_answer=re.compile("答案")
for line in work_book:
if pattern_answer.search(line)==None:
write_in.write(line)
write_in.close()
print("OK")