数据预处理一：文本分词并且去停用词代码

最新推荐文章于 2024-03-22 12:59:16 发布

VIP文章小懒快要丑哭啦

最新推荐文章于 2024-03-22 12:59:16 发布

阅读量4k

点赞数

文章标签：文本分类去停用词数据预处理

本文链接：https://blog.csdn.net/mr_pgz/article/details/100880622

版权

数据结构：搜狗数据集（最外层文件夹） $\rightarrow$ 类别（第二层文件夹，比如说军事） $\rightarrow$ 10.txt（军事类别下面的一个文本文件）

import os
import jieba

# 保存文件的函数
def savefile(savepath, content):
    fp = open(savepath, 'w', encoding='ANSI',errors='ignore')
    fp.write(content)
    fp.close()

# 读取文件的函数
def readfile(path):
    fp = open(path, "r", encoding='ANSI', errors='ignore')
    content = fp.read()
    fp.close()
    return content

## 去除停用词的2个函数
# 创建停用词list
def stopwordslist(filepath):
    stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]
    return stopwords

# 对句子去除停用词
def movestopwords(sentence):
    s

最低0.47元/天解锁文章

优惠劵

小懒快要丑哭啦

关注关注

0
点赞
踩
9

收藏

觉得还不错? 一键收藏
0
评论
数据预处理一：文本分词并且去停用词代码

数据结构：搜狗数据集（最外层文件夹）类别（第二层文件夹，比如说军事）10.txt（军事类别下面的一个文本文件）import osimport jieba# 保存文件的函数def savefile(savepath, content): fp = open(savepath, 'w', encoding='ANSI',errors='ignore') fp.write...
复制链接

扫一扫