[深度学习]Part1 Python高级Ch26 NLP基础——【DeepBlue学习笔记】

最新推荐文章于 2024-07-26 06:30:00 发布

LiongLoure

最新推荐文章于 2024-07-26 06:30:00 发布

阅读量368

点赞数

分类专栏： python 深度学习机器学习文章标签：深度学习 python 自然语言处理

本文链接：https://blog.csdn.net/LiongLoure/article/details/125690910

版权

python 同时被 3 个专栏收录

22 篇文章 0 订阅

订阅专栏

机器学习

21 篇文章 0 订阅

订阅专栏

深度学习

18 篇文章 0 订阅

订阅专栏

本文仅供学习使用（ocr入门包，具体的文字识别需了解其他内容）

Python高级——Ch26 NLP基础

26. NLP基础

26. NLP基础

在这里插入图片描述

安装：pip install jieba
import jieba

import jieba
words_a2='在正义者联盟的电影里，嘻哈侠和蝙蝠侠联手打败了大boss，我高喊666，为他们疯狂打call'
result_l = jieba.lcut(words_a2)
print(result_l)
'''
['在', '正义者', '联盟', '的', '电影', '里', '，', '嘻哈侠', '和', '蝙蝠侠', '联手', '打败', '了', '大', 'boss', '，',
 '我', '高喊', '666', '，', '为', '他们', '疯狂', '打', 'call']
'''

26.1 分词算法

基于字典的分词算法： 正向/逆向最长匹配算法、双向最大分词。
优点：分词方法简单、速度快，效果也还可以
缺点：对歧义和新词的处理不是很好，对词典中未登录的词没法进行处理

基于统计的分词算法： 机器学习和深度学习算法，例如HMM，LSTM+CRF等
优点：对于未登录词能、歧义词能进行比较好的处理
缺点：需要有大量人工标注分好词的语料作为支撑，训练开销大，即标注工作比较大、模型比较复杂

26.1.1 基于字典的分词算法

正向/逆向最长匹配算法：
以某个下标为起点递增查词的过程中，优先输出更长的单词，这种规则被称为最长匹配算法。该下标如果从前往后则称为正向最长匹配，反之称为逆向最长匹配。
双向最长匹配算法：
融合正向/逆向两种算法的复杂规则集。流程如下：
（1）同时执行正向和逆向最长匹配，若两者的词数不同，则返回词数更少的那一个。
（2）否则，返回两者中单字更少的那一个。当单字数也相同时，优先返回逆向最长匹配的结果。
这种规则的出发点来自语言学上的启发-—汉语中单字词的数量要远远小于非单字词。因此，算法应当尽量减少结果中的单字，保留更多的完整词语，这样的算法也称启发式算法。

词数更少优先级更高
单字数更高优先级更高
都相等时逆向匹配优先级更高

26.1.2 代码实现

import jieba
#1、正向最大匹配
def get_forw_word(sentence):
    """
    :param sentence: 需要分词的句子
    """
    for_res = []  # 存放正向匹配的结果
    len_sen = len(sentence)  # len_sen为当前为划分句子的长度
    i=0
    while i < len_sen:
        longest_word = sentence[i]
        for j in range(i + 1, len_sen):
            text=sentence[i:j+1]
            if text in dic_list:
                longest_word=text
        for_res.append(longest_word)
        i=i+len(longest_word)
    return for_res
#2、逆向最大匹配
def get_back_word(sentence):
    back_res = []  # 存放逆向匹配的结果
    len_sen = len(sentence)
    i=len_sen-1
    while i > -1:
        longest_word = sentence[i]
        for j in range(i):
            text=sentence[j:i+1]
            if text in dic_list:
                longest_word=text
                break   # 遇到第一个匹配到的就终止匹配
        # back_res.append(longest_word)
        back_res.insert(0, longest_word)
        i=i-len(longest_word)
    # back_res=back_res[::-1]  # 倒序输出
    return back_res

# 双向最大匹配
def get_twobila_word(sentence):
    for_res = get_forw_word(sentence)
    back_res = get_back_word(sentence)
    #1.词数更少优先
    twobila_res =  back_res if len(back_res) <= len(for_res) else for_res
    #2.单个字的词更少优先
    a=len([word for word in for_res if len(word)==1])# a = sum([1 for word in for_res if len(word)==1])
    b=len([word for word in back_res if len(word)==1])
    #3.逆向分词
    twobila_res = for_res if a < b else back_res
    return twobila_res
if __name__ == '__main__':
    sent = "女施主自重，贫僧出家人家法号戒色"  # 待分词的句子
    max_length=3
    dic_list = jieba.lcut(sent, all)   # 设置分词引用的字典，通过 jieba 全模式生成
    sent = '北京大学的学生前来应聘'
    dic_list = jieba.lcut(sent, all)   # 设置分词引用的字典，通过 jieba 全模式生成
    print('dict:', dic_list)
    print(dic_list)
    print(get_forw_word(sent))
    print(get_back_word(sent))
    print(get_twobila_word(sent))

26.2 字典树

字符串集合常用字典树（trie树、前缀树）存储，这是一种字符串上的树形数据结构。字典树中每条边都对应一个字，从根节点往下的路径构成一个个字符串。字典树并不直接在节点上存储字符串，而是将词语视作根节点到某节点之间的一条路径，并在终点节点（蓝色）上做个标记"该节点对应词语的结尾"。字符串就是一条路径，要查询一个单词，只需顺着这条路径从根节点往下走。如果能走到特殊标记的节点，则说明该字符串在集合中，否则说明不存在。
在这里插入图片描述

26.2.1 典型的字典树

其中，蓝色标记着该节点是一个词的结尾，数字是人为的编号。

这棵树中存储的字典如表所示，你可以拿一支笔顺着表所示的路径走，看看能否查到对应的单词。
在这里插入图片描述

26.2.2 字典树的节点实现

每个节点都应该至少知道自己的子节点与对应的边，以及自已是否对应一个词。如果要实现映射而不是集合的话，还需要知道自己对应的值。我们约定用值为 None表示节点不对应词语，虽然这样就不能插入值为None 的键了，但实现起来更简洁。那么节点的实现用Python 描述如下（详见 tests/book/ch02/trie.py）∶
在这里插入图片描述
在_add_child方法中，我们先检查是否已经存在字符char对应的child，然后根据overwrite来决定是否覆盖 child的值。通过这个方法，就可以把子节点连接到父节点上去。

26.2.3 代码实现

# encoding:utf8

class Node:
    def __init__(self, value):
        self.children = {}# 字典
        self.value = value# 词尾标记 非None表示的是词尾
        
    def add_child(self, childrenkey, childvalue, overwrite=False):
        # 查看该孩子是否存在
        child = self.children.get(childrenkey)
        if not child:# 如果不存在 就新建
            child = Node(childvalue)
            self.children[childrenkey] = child
        elif overwrite:# 更新
            child.value = childvalue
        return child
    def show_help(self, times=1):
        #idx = 1
        for idx, items in enumerate(self.children.items()):
            key, value = items
            if times != 1:
                print('\n', end='')
            print('    '*times, end='')
            print("{'%s': " % key, end='')
            value.show_help(times + 1)
            print('\n', end='')
            print('    '*times, end='')
            print('}', end='')
            if idx < len(self.children):
                print(', \n', end='')
            if times == 1:
                print('\n', end='')
            #idx += 1

    def show(self):
        print('{root:')
        self.show_help()
        print('}')    


class Trie(Node):
    def __init__(self):
        super().__init__(None)
    
    def __setitem__(self, key, value):
        '''
        key：是待添加的字符串，字符串中的每个字符是一个节点
        value：词尾标记，所以只能传递给这个字符串的最后一个字符
        '''
        father = self # self是根节点
        for idx, char in enumerate(key):
            if idx == len(key) - 1:
                father = father.add_child(char, value, True)
            else:
                father = father.add_child(char, None, False)
    
    def __getitem__(self, key):
        father = self
        for childrenkey in key:
            father = father.children.get(childrenkey)
            if not father:# 如果该节点不存在，则后面的节点也不用看了
                break
        if father:
            return father.value
        return None
    def __contains__(self, key):
        item = self[key]
        if item is None:
            return False
        return True
    
if __name__ == '__main__':
    trie = Trie()
    
    # 增 __setitems__
    trie['自然语言'] = 0
    trie['自然'] = 1
    
    # 查 
    value = trie['自然'] #1 # __getitem__
    value = trie['自然语言'] # 0
    value = trie['自然语'] # None
    print('自然' in trie) # __contains__
    print('自然语' in trie)
    print('自然语言' in trie)