今天突然看到一到面试题是单词匹配,就想着自己做做看
从网上找了一个常用单词的文件
思考一下,自己的实现方案
- 遍历匹配
- 二分查找法
建立单词数
同时再和Python内置的set比较一下
首先编写计算时间的函数
def time_clock(func):
import time
import functools
@functools.wraps(func)
def _(*args,**kwargs):
before = time.time()
result = func(*args,**kwargs)
print 'duration = %s'%(time.time()-before)
return result
return _
然后是读取文件转为list
words = map(lambda s:s.strip('\n').lower(),open('words.txt').readlines())
由于单个数据不具有代表性,所以我们对列表中所有的元素进行查找计算时长
@time_clock
def match_all(func):
import sys
sys.stdout.write('func_name:%s '%(func.__name__).ljust(18))
for word in words:
if not func(word):
print word
raise Exception("don't match")
遍历法
def match_word_for(word):
for w in words:
if w==word:
return True
return False
最简单,无耻的一种方法,效率也是低到惨无人道。。。。。
二分查找
def match_word_middle(word):
start = 0
end = len(words)-1
while start <= end:
middle_index = (start+end)/2
middle = words[middle_index]
if word == middle:
return True
elif word < middle:
end = middle_index-1
else:
start = middle_index+1
return False
不过二分查找之前需要进行排序
words.sort()
单词树
首先是建立单词树
class Char_node(object):
def __init__(self,char=0):
self.char = char
self.children = {}
def find_char(self,char):
return self.children.get(char,None)
def add_char(self,char):
node = self.children.get(char,None)
if not node:
node = Char_node(char)
self.children[char] = node
return node
def __repr__(self):
return "<Char_node : %s>"%self.char
tree_root = Char_node(0)
def add_word(word):
node = tree_root
for char in word:
node = node.add_char(char)
else:
node.add_char('$')
def build_char_tree(words):
for word in words:
add_word(word)
然后再是在单词树中进行查询,此处我用了一下Python里面的dict数据结构,理因自己实现相关的数据结构,,,,
def match_word_tree(word):
node = tree_root
for i in range(len(word)):
node = node.find_char(word[i])
if not node:
return False
if i == len(word)-1 and node.children.get('$'):
return True
return False
我们来比较一下他们的速度快慢
if __name__ == '__main__':
build_char_tree(words)
words.sort()
match_all(match_word_for)
match_all(match_word_middle)
match_all(match_word_tree)
func_name:match_word_for duration = 0.192780017853
func_name:match_word_middle duration = 0.0136761665344
func_name:match_word_tree duration = 0.0193219184875
[Finished in 0.3s]
可以看得出遍历确实慢的难以忍受!
不过二分查找竟然比单词树还快,比较吃惊
因为单词树的查找次数就是单词的长度而已,二分法的查找次数O(log2n),我的单词表是2000个,按理应该单词树比较快
然后我们再看一下Python自带的数据结构的查询
- set
- list
set
words_set = set(words)
def match_word_set(word):
return word in words_set
list
def match_word_list(word):
return word in words
来个总对比
if __name__ == '__main__':
build_char_tree(words)
words.sort()
match_all(match_word_set)
match_all(match_word_list)
match_all(match_word_for)
match_all(match_word_middle)
match_all(match_word_tree)
func_name:match_word_set duration = 0.000779867172241
func_name:match_word_list duration = 0.0541019439697
func_name:match_word_for duration = 0.183423042297
func_name:match_word_middle duration = 0.0140800476074
func_name:match_word_tree duration = 0.0185759067535
[Finished in 0.4s]
set的速度快的令人发指,据说底层是采用散列表存储
但list的倒是比较慢,但也比单纯的遍历快多了,
总结:在判断一个元素是否在一个序列中的时候用set,速度完全和list不再一个数量级上的
不过我认为单词树仍是个不错的选择,假如单词表非常非常大的情况下~