python判断是否在列表_Python:如何有效检查项目是否在列表中?

本文探讨了在处理大量文本数据(约6亿行)时,如何通过使用数据结构优化从单词列表中查找元素的时间效率。作者提出考虑使用trie、DAWG或数据库,如Python实现,实现在600万行文本中快速匹配的解决方案,相比于列表操作,性能提升显著。
摘要由CSDN通过智能技术生成

bd96500e110b49cbb3cd949968f18be7.png

I have a list of strings (words like), and, while I am parsing a text, I need to check if a word belongs to the group of words of my current list.

However, my input is pretty big (about 600 millions lines), and checking if an element belongs to a list is a O(n) operation according to the Python documentation.

My code is something like:

words_in_line = []

for word in line:

if word in my_list:

words_in_line.append(word)

As it takes too much time (days actually), I wanted to improve that part which is taking most of the time. I have a look at Python collections, and, more precisely, at deque. However, the only give a O(1) operation time access to the head and the tail of a list, not in the middle.

Do someone has an idea about how to do that in a better way?

解决方案

You might consider a trie or a DAWG or a database. There are several Python implementations of the same.

Here is some relative timings for you to consider of a set vs a list:

import timeit

import random

with open('/usr/share/dict/words','r') as di: # UNIX 250k unique word list

all_words_set={line.strip() for line in di}

all_words_list=list(all_words_set) # slightly faster if this list is sorted...

test_list=[random.choice(all_words_list) for i in range(10000)]

test_set=set(test_list)

def set_f():

count = 0

for word in test_set:

if word in all_words_set:

count+=1

return count

def list_f():

count = 0

for word in test_list:

if word in all_words_list:

count+=1

return count

def mix_f():

# use list for source, set for membership testing

count = 0

for word in test_list:

if word in all_words_set:

count+=1

return count

print "list:", timeit.Timer(list_f).timeit(1),"secs"

print "set:", timeit.Timer(set_f).timeit(1),"secs"

print "mixed:", timeit.Timer(mix_f).timeit(1),"secs"

Prints:

list: 47.4126560688 secs

set: 0.00277495384216 secs

mixed: 0.00166988372803 secs

ie, matching a set of 10000 words against a set of 250,000 words is 17,085 X faster than matching a list of same 10000 words in a list of the same 250,000 words. Using a list for the source and a set for membership testing is 28,392 X faster than an unsorted list alone.

For membership testing, a list is O(n) and sets and dicts are O(1) for lookups.

Conclusion: Use better data structures for 600 million lines of text!

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值