对含有中英文的文本去除停用词结巴分词

最新推荐文章于 2023-06-05 08:00:00 发布

小李爱发呆

最新推荐文章于 2023-06-05 08:00:00 发布

阅读量1.4k

点赞数 1

分类专栏：结巴分词停用词 python 文章标签： python pycharm

本文链接：https://blog.csdn.net/weixin_45899520/article/details/108965577

版权

python 同时被 3 个专栏收录

5 篇文章 0 订阅

订阅专栏

结巴分词

2 篇文章 0 订阅

订阅专栏

停用词

1 篇文章 0 订阅

订阅专栏

对含有中英文的文本去除停用词分词
这里的停用词表可以自己定义或者采用网上的
是文本分类情感分析进行预处理的步骤

from collections import Counter
import jieba

**# jieba.load_userdict('userdict.txt')
**# 创建停用词list****
def stopwordslist(filepath):
    stopwords = [line.strip() for line in open(filepath, 'r').readlines()]
    return stopwords

#对句子分词
def seg_sentence(sentence):
    sentence_seged = jieba.cut(sentence.strip())#jieba分词对象
    stopwords = stopwordslist('E:\\pythonimg\\stopword.txt')  # 这里加载停用词的路径  这里可以再加自定义的停用词
    outstr = ''
    for word in sentence_seged:
        if word not in stopwords:
            if word != '\t':
                outstr += word
                outstr += " "
    return outstr


inputs = open('E:\\pythonimg\\comment.txt', 'r',encoding='utf-8')  # 加载要处理的文件的路径
outputs = open('E:\\pythonimg\\已去除停用词.txt', 'w',encoding='utf-8')  # 加载处理后的文件路径
for line in inputs:
    line_seg = seg_sentence(line)  # 这里的返回值是字符串
    outputs.write(line_seg)
outputs.close()
inputs.close()
#wordcount
with open('E:\\pythonimg\\已去除停用词.txt', 'r',encoding='utf-8') as fr:  # 读入已经去除停用词的文件  加载处理后的文件路径
    data = jieba.cut(fr.read())
data = dict(Counter(data))

测试实例
在这里插入图片描述

在这里插入图片描述

小李爱发呆

关注

1
点赞
踩
12

收藏

觉得还不错? 一键收藏
1
评论
对含有中英文的文本去除停用词结巴分词

对含有中英文的文本去除停用词分词这里的停用词表可以自己定义或者采用网上的是文本分类情感分析进行预处理的步骤from collections import Counterimport jieba**# jieba.load_userdict('userdict.txt')**# 创建停用词list****def stopwordslist(filepath): stopwords = [line.strip() for line in open(filepath, 'r').rea
复制链接

扫一扫