英文分词器代码实现解读

最新推荐文章于 2023-09-09 17:38:01 发布

wxyfennie

最新推荐文章于 2023-09-09 17:38:01 发布

阅读量1.8k

点赞数

分类专栏： NLP

本文链接：https://blog.csdn.net/wxyfennie/article/details/54178026

版权

NLP 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

本文深入解读了Python中的类装饰器，作为元类替代方案的使用场景，并探讨了缓存制表（Memoization）的概念，如何利用缓存避免昂贵的重复计算，提高函数执行效率。

摘要由CSDN通过智能技术生成

知识参考链接点这里

背景知识：分词、朴素贝叶斯、python（装饰器、缓存制表）、后缀数组

类的装饰器

最初，我们说装饰器是一个修改另一个函数的函数，但其实它们可以用于修改类或者方法。对类进行装饰并不常见，但某些情况下作为元类(metaclass)的一个替代，类的装饰器是一个有用的工具。

foo = ['important', 'foo', 'stuff']

def add_foo(klass):
    klass.foo = foo
    return klass


@add_foo
class Person(object):
    pass

brian = Person()

print brian.foo
# >> ['important', 'foo', 'stuff']

缓存制表（Memoization）

缓存制表是避免潜在的昂贵的重复计算的一种方法，通过缓存函数每次执行的结果来实现。这样，下一次函数以相同的参数执行，就可以从缓存中获取返回结果，不需要再次计算结果。

from functools import wraps

def memoize(func):
    cache = {}

    @wraps(func)
    def wrapper(*args):
        if args not in cache:
            cache[args] = func(*args)
        return cache[args]
    return wrapper

@memoize
def an_expensive_function(arg1, arg2, arg3):
    ...

上代码：

 35 @memo
 36 def segment(text):
 37     "Return a list of words that is the best segmentation of text."
 38     if not text: return []
 39     candidates = ([first]+segment(rem) for first,rem in splits(text))
 40     return max(candidates, key=Pwords)
 41
 42 def splits(text, L=20):
 43     "Return a list of all possible (first, rem) pairs, len(first)<=L."
 44     return [(text[:i+1], text[i+1:])
 45             for i in range(min(len(text), L))]
 46
 47 def Pwords(words):
 48     "The Naive Bayes probability of a sequence of words."
 49     return product(Pw(w) for w in words)

函数解读：

@memo

缓存制表，存入segment(text) 函数调用的结果

segment(text)

对text进行分词，返回概率最大的分词结果
splits(text,L=20)

将text进行分词，分为两个部分first+rem

Pwords(words)

返回朴素贝叶斯模型下，每个词语的概率

product

返回多个概率值乘积的结果

分词过程讲解：

比如对closethedoor进行分词的时候，很明显最佳的分词结果是close the door，那要如何得出这个最佳的分词结构呢？

比如概率从大到下是这么分布的：p(close) > p(c|lose) > p(c|l|ose) >p(c| lo | se) ………………

这个代码里面采用了后缀数组，避免了大量的重复计算

例如分词segment(close)

candidates = ([first]+segment(rem) for first,rem in splits(text))

在调用这句代码的时候，产生的结果是，c | lose , 进入segment(lose)，分成l 和 ose，进入segment(ose)……如此重复，

这样子的话，五个后缀的概率就全部算出来了，分别是p(close),p(lose),p(ose),p(se),p(e),

再次调用的时候，可以直接使用结果（因为结果已经存在了缓存制表中了），p(cl|o|se) = p(cl)*p(o)*p(se)

所以在segment函数中调用的时候，只完整调用了五次，因为再次调用概率的时候，那些单词已经计算出了概率，省去了大量重复计算。

感觉好像挺有收获的，很有意思，所有就记录下来了~

wxyfennie

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录