我也来求“共词矩阵”

野牛首领

已于 2022-04-19 23:02:17 修改

阅读量1k

点赞数 6

文章标签： python

于 2022-04-19 22:56:43 首次发布

本文链接：https://blog.csdn.net/weixin_44975174/article/details/124285794

版权

共词矩阵 Python 文本分析信息提取矩阵填充

关键词由CSDN通过智能技术生成

构建“共词矩阵”简约版

某个研究内容，先分词，再求“共词矩阵”，我的数据来源是分词结果，数据量有点多，这里我捏造几行简单的数据说明问题。
直接上代码：

import pandas as pd
# 如果数据存储于excel文件或来自分词结果并以其它形式存储，可以将其转换成这样的格式
data = [['我们', '他们', '好兄弟', '好朋友'],
        ['我们', '你们', '好朋友', '好同学'],
        ['他们', '我们', '好兄弟', '好朋友'],
        ['我们', '他们', '好朋友'],
        ['他们', '你们', '好兄弟', '好朋友', '好同学'],
        ['吃饭', '休息', '学习'],
        ['吃饭', '打球', '看球', '学习'],
        ['打球', '学习', '看球', '写代码'],
        ['休息', '吃饭', '学习'],
        ['休息', '学习', '写代码']
       ]
# 1.求所有的词
all_words = list(set([word for item in data for word in item]))   # 总共有哪些词
# 2.求每一个词出现的位置
appear_dict = {}       # 每个词出现在哪些item里面
for word in all_words:
    appear = []
    for i, item in enumerate(data):
        if word in item:
            appear.append(i)
    appear_dict[word] = appear
# print(appear_dic)
# 3.初始化矩阵
matrix = [[0 for j in range(len(all_words))] for i in range(len(all_words))]       # 初始化矩阵
# 4.求共词矩阵
for row in range(0, len(matrix)):
    print(f'正在处理第{row}行', end='')
    for col in range(0, len(matrix)):
        print(f'{col}列数据...', end='')
        if col == row:
            matrix[col][row] = 0
            word = all_words[col]                # 如果对角线上的值需要的是本词出现的总次数，就用这两行
            matrix[col][row] = len(appear_dict[word])
        elif col > row:
            counter = len(set(appear_dict[all_words[row]]) & set(appear_dict[all_words[col]]))
            matrix[col][row] = counter
        else:
            matrix[col][row] = matrix[row][col]
    print('完毕!')
# 5.输出结果
for item in matrix:
    print(item)
# 6.将结果写入excel文件中（为方便观察，也可以写入csv文件或做别的处理）
result_dict = dict(zip(all_words, matrix))
df = pd.DataFrame(result_dict, index=all_words)
df.to_excel(r'./matrix.xlsx')

运行结果：
[3, 1, 1, 0, 0, 2, 0, 0, 0, 0, 3, 0]
[1, 2, 2, 0, 0, 0, 1, 0, 0, 0, 2, 0]
[1, 2, 2, 0, 0, 0, 1, 0, 0, 0, 2, 0]
[0, 0, 0, 2, 1, 0, 0, 1, 2, 2, 0, 1]
[0, 0, 0, 1, 3, 0, 0, 2, 3, 1, 0, 3]
[2, 0, 0, 0, 0, 3, 1, 0, 0, 0, 3, 0]
[0, 1, 1, 0, 0, 1, 2, 0, 0, 0, 2, 0]
[0, 0, 0, 1, 2, 0, 0, 4, 4, 1, 0, 3]
[0, 0, 0, 2, 3, 0, 0, 4, 5, 2, 0, 4]
[0, 0, 0, 2, 1, 0, 0, 1, 2, 2, 0, 1]
[3, 2, 2, 0, 0, 3, 2, 0, 0, 0, 5, 0]
[0, 0, 0, 1, 3, 0, 0, 3, 4, 1, 0, 4]
这里对角线为对应词出现的次数，如果不需要这个值，直接给0就可以了。
参考：兔子爱读书的文章