删除线分为两种, 一是单元格有strike属性, 另一种是单元格无strike属性, 单元格内的部分文本带有strike属性。
单元格的strike判断可以使用如下示例方法:
import openpyxl
book = openpyxl.load_workbook("test.xlsx")
sheet = book.worksheets[0]
for row in sheet.rows:
for cell in row:
if cell.font.strike:
pass
当单元格无strike属性时, 由于openpyxl不支持单元格更底层的元素解析,就只能自己写excel文档的解析方法,来获取带有删除线的内容。
xlsx后缀的Excel文档是由一组xml文档压缩而成,将xlsx文档后缀改为zip,可以解压出一组xml文件。在该目录下的./xl/sharedStrings.xml中,可以通过搜索strike属性来查看带有删除线的内容格式。为了自动处理,使用正则表达式对该xml文件进行匹配,并将匹配到的删除线内容替换为空,即可实现该功能。示例代码如下:
import os
import re
import openpyxl
import zipfile
def excel_strike_content_delete(path, final_file):
extract_path = os.path.join(os.path.split(path)[0], "extract_path")
if os.path.exists(extract_path):
shutil.rmtree(extract_path)
os.mkdir(extract_path)
else:
os.mkdir(extract_path)
f = zipfile.ZipFile(path)
for file in f.namelist():
f.extract(file, extract_path)
f.close()
xml = extract_path + '/xl/sharedStrings.xml'
with open(xml, 'rt', encoding="utf-8") as f:
content = f.readlines()
content = "".join([line for line in content])
def replace(matched):
group0 = matched.group(0)
group1 = matched.group(1)
group2 = matched.group(2)
# print(group0)
res = group0.replace(group2, "<t></t>").replace(group1, "")
# print("res: ", res)
return res
pattern = r"<rPr>(<strike/>).*?</rPr>(<t>(?:.*?)</t>)"
# print(type(content))
line = re.sub("</si><si>", "</si>myLabel<si>", content, flags=re.S)
# print(type(line))
ls = line.split("myLabel")
for i, line in enumerate(ls):
# print(" ", line)
line = re.sub(pattern, replace, line, flags=re.S)
ls[i] = line
# print(line)
with open(xml, 'w+', encoding='utf-8') as f:
f.write("".join(ls))
f = zipfile.ZipFile(final_file, mode='w')
for path, dirnames, filenames in os.walk(extract_path):
# 去掉目标跟路径,只对目标文件夹下边的文件及文件夹进行压缩
fpath = path.replace(extract_path, '')
for filename in filenames:
f.write(os.path.join(path, filename), os.path.join(fpath, filename))
f.close()
return os.path.abspath(final_file)
这样就将excel中的删除线部分删除掉了。