sklearn 读取csv_用sklearn的TF-IDF模块进行短文本关键词提取

最新推荐文章于 2024-08-29 23:40:03 发布

weixin_39596975

最新推荐文章于 2024-08-29 23:40:03 发布

阅读量689

点赞数

文章标签： sklearn 读取csv sklearn导入csv文件

本文介绍如何利用sklearn的TF-IDF模块对新浪体育新闻的短文本进行关键词提取。通过构建文本读取函数、文本降噪处理（去除停用词、数字、长词等）和主函数，展示关键词提取效果。提取出的关键词能够较好地概括文本内容。

摘要由CSDN通过智能技术生成

尝试用sklearn的TF-IDF模块对新浪新闻的部分体育类别短文进行关键词提取

1.构建文本读取函数

def

2.文本降噪,对文本进行去除停用词,去除数字字符,以及仅保留字符串长度大于1及小于5的词

def text_preprossing(context):
    cus = []
    words_cut = jieba.cut(context,cut_all=False)
    for item in words_cut:
        if item not in stop_words and not item.isdigit() and 1<len(item)<5:
            cus.append(item)
            print(cus)
    return ' '.join(cus)

3.主函数

import jieba
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
if __name__=='__main__':
    start = False
    text_data,label = r