深度学习第18天_项目1_文本分词

最新推荐文章于 2024-03-26 13:59:48 发布

过动猿

最新推荐文章于 2024-03-26 13:59:48 发布

阅读量132

点赞数

分类专栏：深度学习和机器学习（一些笔记）文章标签：自然语言处理 python 深度学习

本文链接：https://blog.csdn.net/Dajian1040556534/article/details/120482048

版权

深度学习和机器学习（一些笔记）专栏收录该内容

29 篇文章 3 订阅

订阅专栏

文本分词

准备词典和停用词

（1）准备词典

user_dict_path = "C:/Users/dajian/PycharmProjects/pythonProject9/7.chat_service/corpus/user_dict/keywords.txt"
jieba.load_userdict(config.user_dict_path)

（2）准备停用词

stopwords_path = "C:/Users/dajian/PycharmProjects/pythonProject9/7.chat_service/corpus/user_dict/stopwords.txt"
stopwords = [i.strip() for i in open(config.stopwords_path,encoding="UTF-8").readlines()]

准备按照单个字切分句子的方法

def cut_sentence_by_word(sentence):
    '''
    实现中英文分词
        中文：按单个汉字
        英文：按单词
    '''
    # python和c++哪个难 -> [python,和,c++,哪,个,难]
    result = []
    temp = ""

    for word in sentence:
        if word.lower() in letters:  # 如果word是字母，则添加到temp后面
            temp += word
        else:
            if temp!="":  # 如果word不是字母，且temp不为空，则把temp加入result中
                result.append(temp.lower())
                temp = ""
            result.append(word.strip())  # 如果word不是字母，就直接加入result中
    if temp != "":  # 最后一组如果包含字母，则需要把最后一个加入到result中
        result.append(temp.lower())

    return result

完成分词方法的封装

import jieba
import jieba.posseg as psg
import config
import string
from lib.stopwords import stopwords

# 将准备好的语料导入jieba中
jieba.load_userdict(config.user_dict_path)

#准备英文字符
letters = string.ascii_lowercase+"+"

def cut_sentence_by_word(sentence):
    '''
    实现中英文分词
        中文：按单个汉字
        英文：按单词
    '''
    # python和c++哪个难 -> [python,和,c++,哪,个,难]
    result = []
    temp = ""

    for word in sentence:
        if word.lower() in letters:  # 如果word是字母，则添加到temp后面
            temp += word
        else:
            if temp!="":  # 如果word不是字母，且temp不为空，则把temp加入result中
                result.append(temp.lower())
                temp = ""
            result.append(word.strip())  # 如果word不是字母，就直接加入result中
    if temp != "":  # 最后一组如果包含字母，则需要把最后一个加入到result中
        result.append(temp.lower())

    return result


def cut(sentence,by_word=False,use_stopwords=False,with_sg=False):
    '''
    :param sentence: 句子
    :param by_word: 是否按照单个字分词，默认为False
    :param use_stopwords: 是否使用停用词，默认为False
    :param with_sg: 是否返回词性，默认为False
    :return:
    '''
    if by_word:  # 如果按照字来分词
        result = cut_sentence_by_word(sentence)
    else:
        if with_sg:
            result = psg.lcut(sentence) # jieba.posseg可以将词性也分出来 -> pair(词语，词性)
            result = [(i.word,i.flag) for i in result]
        else:
            result = jieba.lcut(sentence)
    #是否使用停用词
    if use_stopwords:
        result = [i for i in result if i not in stopwords]
    return result

# if __name__ == '__main__':
#     a = "python难不难啊?是不是很难"
#     print(cut(a,use_stopwords=True))

过动猿

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
深度学习第18天_项目1_文本分词

文本分词准备词典和停用词（1）准备词典user_dict_path = "C:/Users/dajian/PycharmProjects/pythonProject9/7.chat_service/corpus/user_dict/keywords.txt"jieba.load_userdict(config.user_dict_path)（2）准备停用词stopwords_path = "C:/Users/dajian/PycharmProjects/pythonProject9/7.c
复制链接

扫一扫

专栏目录