用xlrd读xls类型的文件,结果一直报错
xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'<?xml ve'
最后发现原来该文件是披着xls格式的xml文件,查找了很多资料,发现大多数都是针对纯xml文件的解决方法,对于偏向于excel结构的很少。由于我的xml文件一打开就只有一行,而且很卡,所以光检查文件里的标签就花了很长时间。下面附上一般偏向excel的xml文件。
<?xml version="1.0"?>
<?mso-application progid="Excel.Sheet"?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
xmlns:html="http://www.w3.org/TR/REC-html40">
<DocumentProperties xmlns="urn:schemas-microsoft-com:office:office">
<Title>Excel表格</Title>
<LastAuthor>bigtall</LastAuthor>
</DocumentProperties>
<Styles>
<Style ss:ID="Default" ss:Name="Normal">
<Alignment ss:Vertical="Center"/>
<Font ss:FontName="宋体" x:CharSet="134" ss:Size="12"/>
</Style>
</Styles>
<Worksheet ss:Name="tt">
<Table>
<Row>
<Cell ss:MergeAcross="6" ><Data ss:Type="String">Hello!World!</Data></Cell>
</Row>
</Table>
</Worksheet>
</Workbook>
还可以精简到:
<?xml version="1.0"?>
<?mso-application progid="Excel.Sheet"?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
xmlns:html="http://www.w3.org/TR/REC-html40">
<Worksheet ss:Name="tt">
<Table>
<Row>
<Cell><Data ss:Type="String">Hello!World!</Data></Cell>
</Row>
</Table>
</Worksheet>
</Workbook>
所以对于规整的标签格式,用正则就可以取出对应的元素,然后存入列表,这就是解决这个问题的思路。
#!/usr/bin/python3
# -*- coding: utf-8 -*-
# @Time : 2018/6/13 0013 8:42
# @Author : 一梦南柯
# @File : upload_data4.py
'''
用python读取xml格式的excel文件
'''
# coding=utf-8
import os
import re
import csv
INPUTPATH = u"F:\\GY-DB\\upload_data"
def readxmldata():
strstock = os.path.join(INPUTPATH,'ttt.xml')
fp = open(strstock, "rb")
file_data = fp.read()
data1 = re.findall('<Data ss:Type=("String"|"Number")>([\s\W\w\S]*?)</Data></Cell>', file_data.decode('utf-8'))[1:-2]
print(len(data1))
data2 = []
for i in range(len(data1)):
data2.append(data1[i][1])
len_axis = 28
re_data = []
for ii in range(0, int(len(data2) / len_axis)):
re_data.append(data2[ii * len_axis:(ii + 1) * len_axis])
print(int(len(data1) / len_axis) - 1)
print(re_data[int(len(data1) / len_axis) - 1], len(re_data))
with open('F:\\GY-DB\\upload_data\\{t3}.csv'.format(t3='13ruku1'), 'w',
newline='') as f:
writer = csv.writer(f)
writer.writerows(re_data)
f.close()
if __name__ == '__main__':
readxmldata()
因为xml文件中的标签内有<Data ss:Type="String">,<Data ss:Type="Number">这两种格式,所以用正则匹配或运算同时获取满足这两个条件之一的元素。
如果用这种方式打开的话
dom = xml.dom.minidom.parse('abc.xml')
非常吃内存,一下子吃了6个g吧。
Reference:
https://www.cnblogs.com/bigtall/archive/2004/10/13/51821.aspx
https://blog.csdn.net/lishenluo/article/details/53560961