Python中的map() reduce() filter() sum() Trie树简短分词

taoqick

已于 2024-04-24 21:24:11 修改

阅读量278

点赞数 1

分类专栏： python3 算法文章标签： python 开发语言

于 2021-01-04 00:13:24 首次发布

本文链接：https://blog.csdn.net/taoqick/article/details/112157779

版权

算法同时被 2 个专栏收录

474 篇文章 6 订阅

订阅专栏

python3

120 篇文章 1 订阅

订阅专栏

python 中的for 循环和while 循环的效率比较低。如果遇到循环时，尽量使用map() reduce() filter()。这三个函数的运行速度比较快：

map函数

map()函数他接收一个函数和一个序列。在python3 中返回一个map对象。在python2 中返回一个列表：

b=map(lambda x:print("中秋快乐%s"%x),[1,2,3])
b3 = map(lambda x:x*2,range(10)) #相当于 [ i*2 for i in range(10)]
print(lsit(b3))

reduce函数

reduce()函数在python3 中需要导入from functools import reduce。Python2 中可以直接使用。用于递归计算：

from functools import reduce
from typing import List

n=[1,2,3,4,5]
n5 =reduce(lambda x,y:x*y,n) # 计算5的阶乘
print(n5)

a = reduce(lambda x,y:x * y , [1],2) #第三个参数是初始值
b = reduce(lambda x,y:x * y , [2,2,3],3)
c = reduce(lambda x,y:x - y , [1,2,3,5,6],0)
d = reduce(lambda x,y:x * y , [ ],5)
print(a,b,c,d)

from collections import defaultdict
dic = defaultdict(int)
dic[0],dic[2] = 1,3
print(defaultdict.__getitem__(dic, 5)) #output 0

#还可以用reduce来写trie树：
def __init__(self, words: List[str]):
    add_child = lambda: defaultdict(add_child)
    self.trie = defaultdict(add_child)
    for word in words:
        reduce(dict.__getitem__, word, self.trie)["$"] = True

filter函数

filter 函数是一个过滤器。例如：

b1 = filter(lambda x:x >6 and x<8,range(12)) #相当于  [i for i in range(10) if i >5 and i <8]
print(b1) # [7]

sum函数

splits = [['a'],['b'],['c']]
print(sum(splits, []))

最长匹配的中文单词

用Trie实现可以最长匹配的中文分词：

from collections import defaultdict
from functools import reduce

class Tokenize:
    def __init__(self, words):
        add_child = lambda:defaultdict(add_child)
        self.trie = defaultdict(add_child)
        for word in words:
            reduce(defaultdict.__getitem__, word, self.trie)['$'] = True

    def ch_tokenized(self, sentence):
        res = []
        sentence_len = len(sentence)
        start_pos,last_pos = 0,0 #last_pos记录上一个没有命中关键词的位置，start_pos记录本轮开始匹配的位置

        while (start_pos < sentence_len):
            cur_node = self.trie
            off = start_pos
            while off < sentence_len and sentence[off] in cur_node:
                cur_node = cur_node[sentence[off]]
                off += 1

            if (off > start_pos and '$' in cur_node):
                if start_pos > last_pos:
                    res.append(sentence[last_pos:start_pos])
                res.append(sentence[start_pos:off])
                start_pos = off
                last_pos = off
            else:
                start_pos += 1
        if (start_pos > last_pos):
            res.append(sentence[last_pos:])
        return res

tokenize = Tokenize(['张继科','渣男'])
print(tokenize.ch_tokenized('张继科是渣男啊'))