之前用python的minidom写过解析xml的脚本文件,在前期是比较好用的,因为xml文件比较小。但是当xml文件超过了70M的时候,minidom不仅效率低,而且会占用非常大的内存空间,因为他是将整个xml读入进去并且按照整个xml树进行建树(虽然这样写代码逻辑清晰,但是确实效率低,内存占用高)。70M的xml,我8G内存吃了4个多G,太可怕了。考虑到以后这个读取的xml文件可能还需要扩大,所以抓紧时间写了一个一个新的读取脚本。
在此之前,参考了这篇文章以及这篇文章之后,决定采用里面说的ET_iter方式实现。
然后,我找到了这个博主的文章,仿照这上面的代码进行了进行了编写:
# coding=utf-8
__author__ = 'Arthur'
import mysql.connector
import sys
import xml.etree.cElementTree as ET
if __name__=="__main__":
for event, elem in ET.iterparse("test2.xml", events=('start','end')):
if event == 'start':
if elem.tag=='product' or elem.tag=='property' or elem.tag=='evaluation':
print(elem.attrib)
elif elem.tag == 'result':
a_result = {}
a_result=elem.attrib
a_result['value']=elem.text
if(elem.text==None):
print("result none")
else:
print(a_result)
elif event == 'end':
if elem.tag == 'products':
print("deal with products over")
elif elem.tag == 'propertys':
print("deal with propertys over")
elif elem.tag == 'evaluations':
print("deal with evaluations over")
elif elem.tag == 'results':
print("deal with results over")
elem.clear()
前面使用自己构造的xml文件发现没有问题:
<?xml version='1.0' encoding='utf-8'?>
<testresults source="ICRT EvalDB" type="data"
user="unknown">
<project id_project="697"
icrt_code="IC16539"
name="Combined Wearables"
comment="">
<snapshots>
<snapshot id_snapshot="4"
name="Combined snapshot"
timestamp_created="1471515160"
timestamp_lastchange="1482147798"
time_lastchange="2016-12-19 (11:43)">
<manufacturers>
<manufacturer id_manufacturer="1"
name="Apple"
comment=""
timestamp_created="1465471929"
timestamp_lastchange="0" />
<manufacturer id_manufacturer="2"
name="Fitbit"
comment=""
timestamp_created="1465471929"
timestamp_lastchange="0" />
</manufacturers>
<productgroups>
<productgroup id_productgroup="1"
name="SMARTWATCH"
comment=""
timestamp_created="1465471929"
timestamp_lastchange="0" />
<productgroup id_productgroup="2"
name="FITNESS TRACKER"
comment=""
timestamp_created="1465471929"
timestamp_lastchange="0" />
</productgroups>
<products>
<product id_product="10"
icrt_code="IC16539-0036-00"
modelname="Gear S2"
completename="Samsung Gear S2"
shortname=""
systemmodelid=""
releasedate=""
labreportdate="2016-05-27T00:00:00.000"
labarrivaldate="2016-05-06T00:00:00.000"
boughtbyorganisation="WHICH"
serialnumber="RFAH105HFQF"
articlenumber="8.80608808859E+12"
comment=""
id_productgroup="1"
id_manufacturer="9"
sortorder="0"
batch="1"
labcode=""
parentmodelcode=""
similarmodelscodes=""
testtype=""
picture_lores=""
picture_hires=""
timestamp_created="1465471929"
timestamp_lastchange="1466062628" />
<product id_product="11"
icrt_code="IC16539-0040-00"
modelname="Vivofit 3"
completename="Garmin Vivofit 3"
shortname=""
systemmodelid=""
releasedate=""
labreportdate="2016-06-15T00:00:00.000"
labarrivaldate="2016-06-24T00:00:00.000"
boughtbyorganisation="WHICH"
serialnumber="4R0201708"
articlenumber="53759 15457"
comment=""
id_productgroup="2"
id_manufacturer="3"
sortorder="0"
batch="2"
labcode=""
parentmodelcode=""
similarmodelscodes=""
testtype=""
picture_lores=""
picture_hires=""
timestamp_created="1469800248"
timestamp_lastchange="1475593828" />
<product id_product="12"
icrt_code="IC16539-0047-00"
modelname="Go"
completename="Withings Go"
shortname=""
systemmodelid=""
releasedate=""
labreportdate="2016-06-15T00:00:00.000"
labarrivaldate="2016-06-24T00:00:00.000"
boughtbyorganisation="WHICH"
serialnumber="00:24:E4:39:F0:0D"
articlenumber="700546 701481"
comment=""
id_productgroup="2"
id_manufacturer="10"
sortorder="0"
batch="2"
labcode=""
parentmodelcode=""
similarmodelscodes=""
testtype=""
picture_lores=""
picture_hires=""
timestamp_created="1469800248"
timestamp_lastchange="1475593828" />
</products>
<propertygroups>
<propertygroup id_propertygroup="36"
name="Features|inventory"
comment=""
timestamp_created="1465222484"
timestamp_lastchange="0" />
<propertygroup id_propertygroup="37"
name="Features|Smart"
comment=""
timestamp_created="1465222484"
timestamp_lastchange="0" />
</propertygroups>
<propertys>
<property id_property="381"
id_propertygroup=""
binding="FIRMWARE"
name="Firmware version on device"
comment=""
max="0"
min="0"
unit=""
precision="0"
type="String"
use="1"
testprogram="1.1.3"
timestamp_created="1465222485"
timestamp_lastchange="1465222485" />
<property id_property="382"
id_propertygroup=""
binding="COMPATABILITY"
name="What phones are compatible with device"
comment=""
max="0"
min="0"
unit=""
precision="0"
type="String"
use="1"
testprogram="1.1.7"
timestamp_created="1465222485"
timestamp_lastchange="1468831229" />
</propertys>
<calculationtypes>
<calculationtype id_calculationtype="0"
name="Arithmetic mean calculation" />
<calculationtype id_calculationtype="5"
name="Geometric mean calculation" />
<calculationtype id_calculationtype="1"
name="Versatility calculation" />
<calculationtype id_calculationtype="2"
name="Free formula calculation (complex)" />
<calculationtype id_calculationtype="3"
name="Minimum calculation" />
<calculationtype id_calculationtype="4"
name="Maximum calculation" />
</calculationtypes>
<evaluations>
<evaluation id_evaluation="3165"
id_childs="3185,3199,3176,3166,3180,3175,3195,3615"
id_parent="0"
id_calculationtype="0"
name="total test result"
binding=""
use_inheritna="0"
use_lookuptable="0"
use_limiting="0"
weighting_normalized="0"
weighting_given="1"
lookuptable="0.5,1.5,2.5,3.5,4.5,5.5" unit=""
precision="3"
timestamp_created="1465222499"
timestamp_lastchange="1467972637" />
<evaluation id_evaluation="3166"
id_childs="3167"
id_parent="3165"
id_calculationtype="0"
name="App"
binding=""
use_inheritna="0"
use_lookuptable="0"
use_limiting="0"
weighting_normalized="0"
weighting_given="0"
lookuptable="0.5,1.5,2.5,3.5,4.5,5.5" unit=""
precision="3"
timestamp_created="1465222499"
timestamp_lastchange="1467969418" />
</evaluations>
<results>
<result id_product="1"
id_evaluation="3165"
is_downgrading="0"
downgrading_value="">3.98268146</result>
<result id_product="1"
id_evaluation="100000635"
is_downgrading="0"
downgrading_value="">Provides reminders to stand every hour. You can set progress updates to be given every 4, 6 or 8 hours. Congratulates you when you complete a goal and provides individual feedback and history of activity data. Notifications to focus on specific goals _eg activity__, tells you what percentage of your goal is complete </result>
<result id_product="1"
id_evaluation="100000636"
is_downgrading="0"
downgrading_value="">1</result>
<result id_product="1"
id_evaluation="100000637"
is_downgrading="0"
downgrading_value="">Using the workout app gives you a breakdown of steps, total and active calories and distance covered for that session as well adding these values onto daily accumulated totals</result>
<result id_product="1"
id_evaluation="100000638"
is_downgrading="0"
downgrading_value="">1</result>
</results>
</snapshot>
</snapshots>
</project>
</testresults>
不过当真正使用的时候,发现有时候文本elem.text读取不正确,明明有值但是读取的时候发现还是None。调了半天都不知道为什么(因为自己构造的xml始终不是真实的,所以肯定不能完全模拟),找了半天终于找到了一段官方说明:
If you need a fully populated element, look for “end” events instead.
好了,原来是因为start事件开始的时候只能保证属性存在,不能保证value值以及子节点存在。所以目测改成了使用end事件响应就对了。然而我改成end事件响应过后,发现居然连小xml文件读取都有问题……这是为什么呢?好在这个问题好调试,调试一番发现问题其实很简单:因为我的触发信号是start以及end,但是start触发过后什么也没有做就把elem.clear()了,结果到end事件进来响应的时候只有一个空节点了……
所以说!!!!!触发事件一般不用使用start和end两个触发条件,之前看那个博主同时使用start以及end完全不必要,使用一个就好,除非你有其他特殊需求,比如需要继续使用根节点之类的,读取值的时候要保证是在end的时候读取并且end时当前节点没有clear.
最后完成的有效代码:
# coding=utf-8
__author__ = 'Arthur'
import mysql.connector
import sys
import xml.etree.cElementTree as ET
if __name__=="__main__":
for event, elem in ET.iterparse("test.xml", events=('end',)):#注意这里只使用end进行触发即可
if elem.tag=='product' or elem.tag=='property' or elem.tag=='evaluation':
print(elem.attrib)
elif elem.tag == 'result':
a_result = {}
a_result=elem.attrib
a_result['value']=elem.text
if(elem.text==None):
print("result none")
else:
print(a_result)
if elem.tag == 'products':
print("deal with products over")
elif elem.tag == 'propertys':
print("deal with propertys over")
elif elem.tag == 'evaluations':
print("deal with evaluations over")
elif elem.tag == 'results':
print("deal with results over")
elem.clear()
从调研新XML解析方法到实现重构代码只花了1小时,结果写出bug调代码一搞就是1个半小时,蛋疼。