单元格如此格式的内嵌图片=DISPIMG("ID_14C8AB278D76409D8D12A9530852E2AD",1)
开发过程:
1、了解xlsx文件结构:原来就是zip
2、解压xlsx,了解文档的关联:
2.1./xxx/xl/media保存了图片
2.2./xxx/xl/_rels/cellimages.xml.rels保存了图片文件名和rId的关系
2.3./xxx/xl/cellimages.xml保存了rId和ID的关系如此格式("ID_14C8AB278D76409D8D12A9530852E2AD")
3、制作了ReadWpsXlsxImg()模块,以后方便使用:
3.1 使用时先实例化 ReadWpsXlsxImg() 对象,然后给其属性inputpath 赋值(待解析xlsx文件路径)
3.2 调用 execute() 方法,
3.3 结果保存在nameimgdatadic属性dic中,其key为类似“ID_14C8AB278D76409D8D12A9530852E2AD”的ID,其值为图片二进制数据;
4、问题:
这模块开发过程中获取cellimages.xml中rId的值,弄了好久,现在都没明白,不过ai帮忙搞定了,就是这句成功了: embed= blip.find('.//a:blip', namespaces={'a': 'http://schemas.openxmlformats.org/drawingml/2006/main', 'r': 'http://schemas.openxmlformats.org/officeDocument/2006/relationships'}).attrib.get('{http://schemas.openxmlformats.org/officeDocument/2006/relationships}embed')
5、不喜勿喷,新手练习
'''
功能:找到excel单元格如此格式的内嵌图片=DISPIMG("ID_14C8AB278D76409D8D12A9530852E2AD",1)
1. 实现过程:
1.1 在输入文件名路径解压 xxx.xlsx 文件,生成路径解压文件夹./xxx
1.2 在./xxx/xl/cellimages.xml中找到图片的 rId,以及name,,形成字典name:rId
1.3 在./xxx/xl/_rels/cellimages.xml.rels 通过rId找到图片的绝对路径;形成name:绝对路径的字典
1.4在./xxx/xl/media 读取图片的二进制数据:形成name:二进制数据的字典
1.5处理完成后删除文件夹
2. 使用方法:
2.1 使用时先实例化 ReadWpsXlsxImg() 对象,然后调用给其属性inputpath 赋值待解析xlsx文件路径
2.2 调用 execute() 方法,
2.3 结果保存在nameimgdatadic属性dic中,其key为类似“ID_14C8AB278D76409D8D12A9530852E2AD”的ID,其值为图片二进制数据
'''
from lxml import etree
import zipfile
import os
import shutil
class ReadWpsXlsxImg():
def __init__(self):
self.inputpath = ''
self.extractpath = ''
self.namerIddic = {} # name:rId
self.namepath = {} # name:imgpath
self.nameimgdatadic = {} # name:imgdata
def execute(self):
if self.inputpath != '':
try:
self.unzip_xlsx()
# print( f'成功解压到{self.extractpath}文件夹')
except Exception as e:
# print(f'解压{self.inputpath}时出错: {str(e)}')
return
# 读取 cellimages.xml 文件,找到name:rId
local_path = 'xl/cellimages.xml'
file_path = os.path.join(self.extractpath, local_path)
# print(file_path)
try:
self.find_name_rId(file_path)
# print( '成功写入name:rId字典')
except Exception as e:
# print(f'写入name:rId字典时出错: {str(e)}')
return
# 读取 cellimages.xml.rels 文件,找到rId:相对路径的字典
local_path = 'xl/_rels/cellimages.xml.rels'
file_path = os.path.join(self.extractpath, local_path)
# print(file_path)
try:
self.find_name_path(file_path)
# print( '成功写入name:imgpath字典')
except Exception as e:
# print(f'写入name:imgpath时出错: {str(e)}')
return
# 读取imgpath的文件,生成二进制数据data,生成name:imgdata字典
try:
self.find_name_imgdata()
# print( '成功写入name:imgdata字典')
except Exception as e:
# print(f'写入name:imgdata时出错: {str(e)}')
return
try:
self.delete_folder()
except Exception as e:
print(f'删除文件夹时出错: {str(e)}')
return
def unzip_xlsx(self):
# 获取不带扩展名的文件名,作为解压目标文件夹名称
xlsx_file_name = os.path.basename(self.inputpath)[:-5] # 去掉 .xlsx 扩展名
# 确定解压目标文件夹路径(与 xlsx 文件同一目录)
extract_dir = os.path.join(os.path.dirname(self.inputpath),xlsx_file_name)
# print(extract_dir)
# 如果目标文件夹不存在,则创建它
if not os.path.exists(extract_dir):
os.makedirs(extract_dir)
# 使用 zipfile 解压 xlsx 文件到目标文件夹
with zipfile.ZipFile(self.inputpath, 'r') as xlsx_zip:
xlsx_zip.extractall(extract_dir)
self.extractpath = extract_dir
def find_name_rId(self,file_path):
tree = etree.parse(file_path)
root = tree.getroot()
# 定义命名空间
ns = {
'etc': 'http://www.wps.cn/officeDocument/2017/etCustomData',
'xdr': 'http://schemas.openxmlformats.org/drawingml/2006/spreadsheetDrawing',
'a': 'http://schemas.openxmlformats.org/drawingml/2006/main',
'r': 'http://schemas.openxmlformats.org/officeDocument/2006/relationships'
}
# 遍历每个 <etc:cellImage> 元素
for cell_image in root.findall('.//etc:cellImage', namespaces=ns):
# 提取 <xdr:pic> 元素
pic = cell_image.find('.//xdr:pic', namespaces = ns)
if pic is None:
continue # 如果找不到 <xdr:pic> 元素,跳过
# 提取 name 属性
name_element = pic.find('.//xdr:cNvPr', namespaces=ns)
if name_element is None or 'name' not in name_element.attrib:
continue # 如果找不到 <xdr:cNvPr> 元素或没有 'name' 属性,跳过
name = name_element.attrib['name']
# 提取 r:embed 属性
blip = pic.find('.//xdr:blipFill', namespaces = ns)
embed= blip.find('.//a:blip', namespaces={'a': 'http://schemas.openxmlformats.org/drawingml/2006/main', 'r': 'http://schemas.openxmlformats.org/officeDocument/2006/relationships'}).attrib.get('{http://schemas.openxmlformats.org/officeDocument/2006/relationships}embed')
self.namerIddic[name] = embed
# 输出结果
# print(self.namerIddic)
def find_name_path(self,xml_file_path):
tree = etree.parse(xml_file_path)
root = tree.getroot()
for relationship in root:
id_value = relationship.attrib['Id']
target_value = relationship.attrib['Target']
# print(f"ID: {id_value}, Target: {target_value}")
for name in self.namerIddic.keys():
if self.namerIddic[name] == id_value:
self.namepath[name] = target_value
# print(f"namepath:{self.namepath}")
def find_name_imgdata(self):
for name, target_path in self.namepath.items():
basepath = os.path.join(self.extractpath,'xl')
imgpath = os.path.join(basepath, target_path)
with open(imgpath, 'rb') as file:
imgdata = file.read()
self.nameimgdatadic[name] = imgdata
print(f'找到二进制文件字典{len(self.nameimgdatadic)}')
def delete_folder(self):
shutil.rmtree(self.extractpath)
if __name__ == '__main__':
reader = ReadWpsXlsxImg()
reader.inputpath =r'C:/Users/hxs19/Desktop/input.xlsx'
reader.execute()