（九）jieba分词后，无法去除停用词的解决方法

最新推荐文章于 2024-01-03 14:02:26 发布

看我七十三变

最新推荐文章于 2024-01-03 14:02:26 发布

阅读量4.1k

点赞数 4

分类专栏： python学习

本文链接：https://blog.csdn.net/HaiYang_Gao/article/details/89527137

版权

博客详细探讨了jieba分词后无法去除停用词的问题，原因是编码问题。通过调试发现，分词结果为unicode对象，而停用词列表为str对象，导致无法直接比较和去除。解决方法是将cut中的每个元素进行编码处理，使其与stopwords中的元素类型一致，从而成功移除停用词并避免写入文件时的错误。

摘要由CSDN通过智能技术生成

问题原因：编码问题

1、测试code

import chardet
if __name__ == '__main__':
    f = file2file()

    s = '中国是个好地方，我住在这里。'
    stopwords = set(sum(f.readtxt('../data/HITstopwords.txt'), []))
    # 查看s字符集
    s_charset = chardet.detect(s)
    # jieba
    cut = jieba.lcut(s)
    # 下面这句一直会报错
    # cut_charset = chardet.detect(cut[0])

    # # 编码处理部分：Begin
    # k =[]
    # for each in cut:
    #     k.append(each.encode('utf-8'))
    # # 编码处理部分：End

    # 去停用词，注意把cut换成k
    cut__stop_data = [word for word in cut if word not in stopwords]
    # cut__stop_data = [word for word in k if word not in stopwords]

    # 写入本地
    open('test.txt', 'w').write(' '.join(cut__stop_data))
    print('------------------Run over-----------------')