NLP实战2--分词api的实现

最新推荐文章于 2022-07-15 01:13:09 发布

Lyttonkeepgoing

最新推荐文章于 2022-07-15 01:13:09 发布

阅读量893

点赞数

分类专栏： NLP实战记录文章标签：自然语言处理深度学习人工智能

本文链接：https://blog.csdn.net/m0_53292725/article/details/121716151

版权

NLP实战记录专栏收录该内容

14 篇文章 3 订阅

订阅专栏

1.准备词典和停用词

1.1 准备词典

1.2准备停用词

stopwords  = set([i.strip() for i in open(config.stopwords_path.readlines()])

# set（） set 是一个不允许内容重复的组合，而且set里的内容位置是随意的，所以不能用索引列出。可进行关系测试，删除重复数据，还可以计算交集、差集、并集等。

# i.strip() for i in open(config.stopwords_path.readlines()) 遍历路径中所有的停用词，返回.strip()

i.strip移除字符串两边的空格

实例：随便写一个验证一下

# read（） readline（） readlines（）的区别

read() #一次性读取文本中全部的内容，以字符串的形式返回结果

readline() #只读取文本第一行的内容，以字符串的形式返回结果

readlines() #读取文本所有内容，并且以数列的格式返回结果，一般配合for in使用

2.准备按照单个字切分句子的方法

def _cut_by_word(sentence):
    # 对中文按照字进行处理，对英文不分为字母
    sentence = re.sub("\s+", " ", sentence)
    sentence = sentence.strip()
    result = []
    temp = ""
    for word in sentence:
        if word.lower() in letters:
            temp += word.lower()
        else:
            if temp != "": # 不是字母
                result.append(temp)
                temp = ""
            if word.strip() in filters: # 标点符号
                continue
            else: # 是单个字 
                result.append(word)
    if temp != "": # 最后的temp中包含字母
        result.append(temp)
    return result

# \s+ 这是正则表达式，通过一定规则的表达式来匹配字符串用的

\s+表示空白字符，包括但不限于空格、回车(\r)、换行(\n)、tab或者叫水平制表符(\t)等，这个根据编码格式不同代表的含义也不一样，感兴趣可以搜索看一下

+ 是重复修饰符，表示它前面与它紧邻的表达式格式相匹配的字符串至少出现一个，上不封顶

\s+ 意思就是至少有一个空白字符存在 temp临时的寄存

re.sub---substitute好记

3.完成分词方法的封装

整个分词的过程每行代码都要搞清楚！！

"""
分词
"""
import jieba.posseg as psg
import jieba
import config
import string
from lib.stopwords import stopwords
jieba.load_userdict(config.user_dict_path)
# 准备英文字符
letters = string.ascii_lowercase+"+"  # 就是a-z小写字母


def cut_sentence_by_word(sentence):
    """
    实现中英文分词
    中文按照单个字分 英文按照单词分词
    """
    # python和c++那个难？ --> [python, 和， c++， 哪， 个， 难，？]
    # 判断是否为英文 如果是英文 就用变量保存起来 直到下一个不是英文 就把之前的所有英文放到分词的列表里面保存起来
    temp = ""
    result = []
    for word in sentence:
        # 把英文单词进行拼接
        if word.lower() in letters:
            temp += word
        else:
            if temp!="":  # 判断temp里面是否为空 如果不是空的说明保存了之前的英文字母
                result.append(temp.lower())
                temp = ""   # 然后把temp重新重置为空字符串
            result.append(word.strip())  # 把中文strip之后放进去
    if temp != "":  # 判断最后字符是否为英文 是的话还是要加到result里面
        result.append(temp.lower())
    return result


def cut(sentence, by_word=False, use_stopwords=False, with_sg=False):
    """
    :param sentence: str 句子
    :param by_word: 是否按照单个字分词
    :param use_stopwords:时候使用停用词
    :param with_sg:是否返回词性
    :return:
    """
    if by_word:
        result = cut_sentence_by_word(sentence)
    else:# 不是按照单个词划分
        result = psg.lcut(sentence)  # 返回词性分词
        result = [(i.word, i.flag) for i in result]
        if not with_sg:
            result = [i[0] for i in result]  # 如果不返回词性 就取i[0]
    # 是否使用停用词
    if use_stopwords:
        result = [i for i in result if i not in stopwords] # 如果不在stopwords里面就遍历所有i
    return result


    return result


if __name__ == '__main__':
    print(string.ascii_lowercase)
    print(cut("python和c++那个难？"))