文本分词后进行tfidf特征提取

最新推荐文章于 2024-05-23 17:51:39 发布

小懒快要丑哭啦

最新推荐文章于 2024-05-23 17:51:39 发布

阅读量694

点赞数

文章标签： TF-IDF 特征提取

本文链接：https://blog.csdn.net/Mr_PGZ/article/details/100881107

版权

该博客探讨了如何在文本预处理后利用TF-IDF算法来提取关键特征，这种方法广泛应用于信息检索和自然语言处理领域。

摘要由CSDN通过智能技术生成

import os
import jieba

# 保存文件的函数
def savefile(savepath, content):
    fp = open(savepath, 'w', encoding='ANSI',errors='ignore')
    fp.write(content)
    fp.close()

# 读取文件的函数
def readfile(path):
    fp = open(path, "r", encoding='ANSI', errors='ignore')
    content = fp.read()
    fp.close()
    return content

# 创建停用词list
def stopwordslist(filepath):
    stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]
    return stopwords

# 对句子去除停用词
def movestopwords(sentence):
    stopwords = stopwordslist('E:/stop_words.txt')  # 这里加载停用词的路径
    outstr = []
    for word in sentence:
        if word not in stopwords:
            #if word!=' ':
                outstr.append(word)
    return outstr

if __name__ == '__main__':

    corpus_path = "E:/sogouDataSet_trai