准备数据
使用路透社新闻数据的一个子集:R8,包含8类新闻。
本文直接读取清洗后的R8,清洗内容包含:去掉特殊字符,标点符号,停用词和低频词,且英文文本不需要分词。
doc_list = []
f = open('R8.clean.txt', 'r')
lines = f.readlines()
for line in lines:
doc_list.append(line.strip())
f.close()
print(doc_list[0])
champion products approves stock split champion products inc said board directors approved two one stock split common shares shareholders record april company also said board voted recommend shareholders annual meeting april increase authorized c