python 读取doc_python构建共现矩阵+pajek可视化

最新推荐文章于 2021-12-08 23:54:05 发布

weixin_39519769

最新推荐文章于 2021-12-08 23:54:05 发布

阅读量288

点赞数

文章标签： python 读取doc

本文链接：https://blog.csdn.net/weixin_39519769/article/details/111635725

版权

本文介绍了如何使用Python处理doc文件，构建共现矩阵，并通过Pajek进行可视化。首先，对文本进行中文分词和停用词处理，然后将数据存储。接着，利用ucinet转换并设定阈值，保存为.net格式。最后，通过Pajek的netdraw工具或添加.vec文件来美化图形。

摘要由CSDN通过智能技术生成

代码源于需求~

第一步，python构建出共现矩阵

中文分词、去停用词
数据存储

第二步，pajek结合ucinet可视化，可vosviewer

上图

python实现共现矩阵

代码见：

https://www.cnblogs.com/Cookie-Jing/p/13837525.html点击原文链接也可

数据预处理部分与之前的主题模型——lda差不多

import numpy as npimport pandas as pdfrom pprint import pprintimport xlrd #读取excel数据import reimport jieba #使用结巴进行中文分词path = r"D:\01研\01大四\2020.3.13-国家突发卫生事件\20201008\lda.xlsx" #修改路径data = xlrd.open_workbook(path)sheet_1_by_index = data.sheet_by_index(0) #读取表一title = sheet_1_by_index.col_values(1) #第二列n_of_rows = sheet_1_by_index.nrowsdoc_set = [] #空列表for i in range(1,n_of_rows): #逐行读取    doc_set.append(title[i])

#从文件导入停用词表def stopwordslist(filepath):    stopwords=[line.strip() for line in open(filepath,'r',encoding='utf-8').readlines()]    return stopwordsstopwords=stopwordslist(r"D:\01研\01大四\2020.3.13-国家突发卫生事件\20201008\stopwords.txt")texts = []#每篇文章关键词word_set = []#每篇文章关键词不重复set_word = []#所有关键词的集合stpwrdlst2 = ['和', '等', '对', '的', '不','与', '一','化','三要','二要']#去停用词2自编，这里是我自己觉得需要去掉的词for doc in doc_set:    #只保留中文    cleaned_doc = ''.join(re.findall(r'[\u4e00-\u9fa5]', doc))    #分词    doc_cut = jieba.lcut(cleaned_doc)    #去停用词    text_list0 = [word for word in doc_cut if word not in stopwords and len(word)>1]    text_list1 = [word for word in text_list0 if word not in stpwrdlst2]    #最终处理好的结果存放于text[]中    texts.append(text_list1)    for word in texts:        word_new = list(set(word))#去除一维数组中相同的词        word_set.append(word_new)    for subword in word_set:        for word in subword:            if word not in set_word:                set_word.append(word)#统计所有出现的词

使用ucinet转化excel存储的共现矩阵

存储为.##d，.##h格式

关键词太多，进行阈值的抽取，保存为.net格式

可视化

使用netdraw打开，图形不太美观

pajek打开.net格式文件

对图要求高一点的话，加上.vec，自己制作

weixin_39519769

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
python 读取doc_python构建共现矩阵+pajek可视化

代码源于需求~第一步，python构建出共现矩阵中文分词、去停用词数据存储第二步，pajek结合ucinet可视化，可vosviewer上图python实现共现矩阵代码见：https://www.cnblogs.com/Cookie-Jing/p/13837525.html点击原文链接也可数据预处理部分与之前的主题模型——lda差不多importnumpyasnpimportpa...
复制链接

扫一扫