上篇文章:http://blog.csdn.net/mmc2015/article/details/50988375 (挖掘DBLP作者合作关系,FP-Growth算法实践(1):从DBLP数据集中提取目标信息(会议、作者等))
大家反映代码不能用,主要是太慢了,好吧,我也承认慢,在内存构造树,肯定的!
这次给出另外两种。
为了完整,先给出dom:
#do not use this code!
def DomParser():
domTree=parse(fileName)
dblp=domTree.documentElement
inproceedingsList=dblp.getElementsByTagName("inproceedings")
for inproceedings in inproceedingsList:
year=inproceedings.getElementsByTagName("year")[0]
yearStr=str(year.childNodes[0].data)
if yearStr<fromYear:
continue
print "yearStr", yearStr, "=="*20
booktitle=inproceedings.getElementsByTagName("booktitle")[0]
booktitleStr=str(booktitle.childNodes[0].data)
#for "<booktitle>ICML Unsupervised and Transfer Learning</booktitle>"
booktitleStr=booktitleStr.split(" ")[0]
if not confNameDict.has_key(booktitleStr):
continue
print "booktitleStr", booktitleStr, "^^"*20
#allList=[] #"confName \t year \t tit