python 中的for 循环和while 循环的效率比较低。如果遇到循环时,尽量使用map() reduce() filter()。这三个函数的运行速度比较快:
map函数
- map()函数他接收一个函数和一个序列。在python3 中返回一个map对象。在python2 中返回一个列表:
b=map(lambda x:print("中秋快乐%s"%x),[1,2,3])
b3 = map(lambda x:x*2,range(10)) #相当于 [ i*2 for i in range(10)]
print(lsit(b3))
reduce函数
- reduce()函数在python3 中需要导入from functools import reduce。Python2 中可以直接使用。用于递归计算:
from functools import reduce
from typing import List
n=[1,2,3,4,5]
n5 =reduce(lambda x,y:x*y,n) # 计算5的阶乘
print(n5)
a = reduce(lambda x,y:x * y , [1],2) #第三个参数是初始值
b = reduce(lambda x,y:x * y , [2,2,3],3)
c = reduce(lambda x,y:x - y , [1,2,3,5,6],0)
d = reduce(lambda x,y:x * y , [ ],5)
print(a,b,c,d)
from collections import defaultdict
dic = defaultdict(int)
dic[0],dic[2] = 1,3
print(defaultdict.__getitem__(dic, 5)) #output 0
#还可以用reduce来写trie树:
def __init__(self, words: List[str]):
add_child = lambda: defaultdict(add_child)
self.trie = defaultdict(add_child)
for word in words:
reduce(dict.__getitem__, word, self.trie)["$"] = True
filter函数
- filter 函数是一个过滤器。例如:
b1 = filter(lambda x:x >6 and x<8,range(12)) #相当于 [i for i in range(10) if i >5 and i <8]
print(b1) # [7]
sum函数
splits = [['a'],['b'],['c']]
print(sum(splits, []))
最长匹配的中文单词
用Trie实现可以最长匹配的中文分词:
from collections import defaultdict
from functools import reduce
class Tokenize:
def __init__(self, words):
add_child = lambda:defaultdict(add_child)
self.trie = defaultdict(add_child)
for word in words:
reduce(defaultdict.__getitem__, word, self.trie)['$'] = True
def ch_tokenized(self, sentence):
res = []
sentence_len = len(sentence)
start_pos,last_pos = 0,0 #last_pos记录上一个没有命中关键词的位置,start_pos记录本轮开始匹配的位置
while (start_pos < sentence_len):
cur_node = self.trie
off = start_pos
while off < sentence_len and sentence[off] in cur_node:
cur_node = cur_node[sentence[off]]
off += 1
if (off > start_pos and '$' in cur_node):
if start_pos > last_pos:
res.append(sentence[last_pos:start_pos])
res.append(sentence[start_pos:off])
start_pos = off
last_pos = off
else:
start_pos += 1
if (start_pos > last_pos):
res.append(sentence[last_pos:])
return res
tokenize = Tokenize(['张继科','渣男'])
print(tokenize.ch_tokenized('张继科是渣男啊'))