一、具体流程
1.读入文本,并进行分词
2.对分词后的文本进行去除停用词
3.使用TF-IDF进行求出权重
4.通过K-means进行聚类
(由于笔者水平较低,只能用自己好理解的方法写,所以看起来很麻烦,见谅)
二、读入文本并分词
1.读入文本
(1)文本来源于搜狗新闻语料库(链接:)
(2)读入文本(代码如下)
def read_from_file(file_name):
with open(file_name) as fp:
words = fp.read()
return words
words = (read_from_file("D:\\PyCharm Community Edition 2018.2.4\\python\\day20181127\\sougou_all\\互联网\\1.txt"))
words1 = (read_from_file("D:\\PyCharm Community Edition 2018.2.4\\python\\day20181127\\sougou_all\\互联网\\2.txt"))
words2 = (read_from_file("D:\\PyCharm Community Edition 2018.2.4\\python\\day20181127\\sougou_all\\互联网\\3.txt"))
words3 = (read_from_file("D:\\PyCharm Community Edition 2018.2.4\\python\\day20181127\\sougou_all\\互联网\\4.txt"))
listall = [words,words1,words2,words3]
2.进行分词
(1)安装jieba库:分词需要安装jieba库,在Pycharm里的setting里的project.interpreter里点击右上方的加号,在搜索框中输入jieba点击应用就可以了。
(2)进行分词:(代码如下)
def cut_words(words):
result = jieba.cut(words)
words = []
for r in result:
words.append(r)
return