尝试用sklearn的TF-IDF模块对新浪新闻的部分体育类别短文进行关键词提取
1.构建文本读取函数
def
2.文本降噪,对文本进行去除停用词,去除数字字符,以及仅保留字符串长度大于1及小于5的词
def text_preprossing(context):
cus = []
words_cut = jieba.cut(context,cut_all=False)
for item in words_cut:
if item not in stop_words and not item.isdigit() and 1<len(item)<5:
cus.append(item)
print(cus)
return ' '.join(cus)
3.主函数
import jieba
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
if __name__=='__main__':
start = False
text_data,label = r