我有两个非常大的xml文件,它们为相同的地方/建筑/房间组合保存不同的数据。我目前在第一个大文件上使用python etree parse,然后循环提取place/building/room id(以及其他信息),然后使用这些id遍历第二个大型xml文件(与第一个相同的结构),我目前正在使用lxml iterparse在第二个文件中查找和提取place元素与第一个文件中的特定位置相关。然后它遍历place元素以找到它工作的相关数据,但是随着我在第一个文件中越来越深入地循环,它继续变得越来越慢。在
我已经尽我所能清除第二个大文件的iterparse中不相关的()元素,但是我有5000个地方要循环,前100个被处理得非常快(不到一分钟),然后接下来的400个需要30分钟,以此类推。15个小时后,我在大约4000个设施和移动非常缓慢。我怀疑其中一个文件的解析包含了太多的数据。在
下面是使用泛化xml的简化代码(抱歉,我不能进一步简化它)。在largefile1 = "largefile1.xml"
largefile2 = "largeFile2.xml"
ptree = ET.parse (largefile1)
proot = ptree.getroot()
o = open('output.txt', 'w')
def get_place_elem(pplaceid,largefile2):
Placenode = ET.iterparse(Largefile2, events=("end",), tag='Place')
for event, Place in Placenode:
for PlaceId in Place.findall('PlaceIdentification'):
placeid = PlaceId.find('PlaceIdentifier').text
if placeid == pplaceid:
del Placenode
return Place
Place.clear()
while Place.getprevious() is not None:
del Place.getparent()[0]
del Placenode
def getfacdata(pplaceid,pbuildid,proomid,Place):
for Build in Place.findall('Building'):
euid = ' '
for BuildId in Build.findall('BuildingIdentification'):
bid = BuildId.find('Identifier').text
if bid ==pbid:
for Room in Build.findall('Room'):
roomid = ' '
for RoomId in Room.findall('RoomIdentification'):
roomid = RoomId.find('Identifier').text
if roomid == proomid:
...Collect data from Room element...
... do some simple math with if statements
return data; # list of 15 data values
for pPlace in proot.findall('.//Place'):
for pPlaceId in pPlace.findall('PlaceIdentification'):
pplaceid = pPlaceId.find('PlaceIdentifier').text
if placeid == pplaceid:
placecnt += 1
#... get some data
for pBuild in pPlace.findall('Buidling'):
for pBuildId in pBuild.findall('BuildingIdentification'):
pbid = pBuildId.find('Identifier').text
for pRoom in pBuild.findall('Room'):
for pRoomId in pRoom.findall('RoomIdentification'):
proomid = pRoom.find('Identifier').text
if prevpplaceid != pplaceid:
if placecnt != 1:Place.clear()
Place = get_fac_elem(pplaceid,largefile2)
prevpplaceid = pplaceid
data = getfacdata(pplaceid,pbid,proomid,Place)
#...Collect data from Room element...
#... do some simple math with if statements
writer = csv.writer(o)
writer.writerow( ( # data from proom and from 'data' list from processing largefile2 in csv format##))
break
prevpplaceid = pplaceid
o.close()
通用xml
^{pr2}$
在