python判断是否在列表_Python：如何有效检查项目是否在列表中？

最新推荐文章于 2023-06-11 02:39:17 发布

weixin_39599830

最新推荐文章于 2023-06-11 02:39:17 发布

阅读量167

点赞数

文章标签： python判断是否在列表

本文链接：https://blog.csdn.net/weixin_39599830/article/details/112993126

版权

本文探讨了在处理大量文本数据（约6亿行）时，如何通过使用数据结构优化从单词列表中查找元素的时间效率。作者提出考虑使用trie、DAWG或数据库，如Python实现，实现在600万行文本中快速匹配的解决方案，相比于列表操作，性能提升显著。

摘要由CSDN通过智能技术生成

I have a list of strings (words like), and, while I am parsing a text, I need to check if a word belongs to the group of words of my current list.

However, my input is pretty big (about 600 millions lines), and checking if an element belongs to a list is a O(n) operation according to the Python documentation.

My code is something like:

words_in_line = []

for word in line:

if word in my_list:

words_in_line.append(word)

As it takes too much time (days actually), I wanted to improve that part which is taking most of the time. I have a look at Python collections, and, more precisely, at deque. However, the only give a O(1) operation time access to the head and the tail of a list, not in the middle.

Do someone has an idea about how to do that in a better way?

解决方案

You might consider a trie or a DAWG or a database. There are several Python implementations of the same.

Here is some relative timings for you to consider of a set vs a list:

import timeit

import random

with open('/usr/share/dict/words','r') as di: # UNIX 250k unique word list

all_words_set={line.strip() for line in di}

all_words_list=list(all_words_set) # slightly faster if this list is sorted...

test_list=[random.choice(all_words_list) for i in range(10000)]

test_set=set(test_list)

def set_f():

count = 0

for word in test_set:

if word in all_words_set:

count+=1

return count

def list_f():

count = 0

for word in test_list:

if word in all_words_list:

count+=1

return count

def mix_f():

# use list for source, set for membership testing

count = 0

for word in test_list:

if word in all_words_set:

count+=1

return count

print "list:", timeit.Timer(list_f).timeit(1),"secs"

print "set:", timeit.Timer(set_f).timeit(1),"secs"

print "mixed:", timeit.Timer(mix_f).timeit(1),"secs"

Prints:

list: 47.4126560688 secs

set: 0.00277495384216 secs

mixed: 0.00166988372803 secs

ie, matching a set of 10000 words against a set of 250,000 words is 17,085 X faster than matching a list of same 10000 words in a list of the same 250,000 words. Using a list for the source and a set for membership testing is 28,392 X faster than an unsorted list alone.

For membership testing, a list is O(n) and sets and dicts are O(1) for lookups.

Conclusion: Use better data structures for 600 million lines of text!

weixin_39599830

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python判断是否在列表_Python：如何有效检查项目是否在列表中？

I have a list of strings (words like), and, while I am parsing a text, I need to check if a word belongs to the group of words of my current list.However, my input is pretty big (about 600 millions l...
复制链接

扫一扫