前段时间在用字典时发现如果集合比较大时,用in语句非常耗时,跑一个三四百M的输入要1个小时;
经过改进用set取代list,并且取消掉in语句,发现速度既然提高60倍,在短短的一分钟之类完成,下面附上代码:
未优化代码:
pvdic={}
uvdic={}
day=sys.argv[1]
for line in sys.stdin:
frags = line.strip().split("\x01")
if (len(frags) == 2 and frags[1].isdigit() ):
uid = frags[0]
dstr = int(frags[1])
if(dstr <= 30 ):
diff = "0-30"
......
else:
diff = "180+"
if diff in pvdic:
pvdic[diff] += 1
else:
pvdic[diff] = 1
if diff in uvdic:
if uid not in uvdic[diff]:
uvdic[diff].append(uid)
else:
uvdic[diff] = [uid]
difflist=["0-30","31-60","61-90","91-120","121-150","151-180","180+"]
pvlist=[]
uvlist=[]
for d in difflist:
if d in pvdic:
pvlist.append(str(pvdic[d]))
else:
pvlist.append("0")
if d in uvdic:
uvlist.append(str(len(uvdic[d])))
else:
uvlist.append("0")
print "%s\tpv\t%s" %(day,"\t".join(pvlist)) ...
优化后代码:
pvdic={}
uvdic={}
day=sys.argv[1]
difflist=["0-30","31-60","61-90","91-120","121-150","151-180","180+"]
for d in difflist:
pvdic[d]=0
uvdic[d]=set()
for line in sys.stdin:
frags = line.strip().split("\x01")
if (len(frags) == 2 and frags[1].isdigit() ):
uid = frags[0]
dstr = int(frags[1])
if(dstr <= 30 ):
diff = "0-30"
......
else:
diff = "180+"
pvdic[diff] += 1
uvdic[diff].add(uid)
pvlist=[]
uvlist=[]
for d in difflist:
pvlist.append(str(pvdic[d]))
uvlist.append(str(len(uvdic[d])))
print "%s\tpv\t%s" %(day,"\t".join(pvlist)) ...