一、前言
互信息,是衡量两个变量X和Y的相关性,对于离散信息互信息的表示如下:
对于连续性变量的定义如下:
其中p(x,y)为联合概率分布函数,p(x)和p(y)为边缘概率分布函数;这里的log来自于信息理论,当取log后,就将一个概率转换为了信息量(要再乘以-1将其变为正数),以2为底时,可以简单理解为取多少个bits表示这个变量。
二、互信息与条件熵、联合熵的关系
熵的定义如下:
H(X)指的是平均信息量,或者说对X的不确定的度量(时间发生的概率越小,它的不确定就越大)
H(X|Y) = H(X,Y)-H(Y)
I(X|Y) = H(Y)-H(Y|X)
具体区别如下图:
三、互信息的词语搭配
利用互信息计算二元词之间的关系,互信息越高,表明X和Y相关性越高,组成的短语可能性也越大。
词语的搭配对于词语搭配的知识库建设具有重要作用,其主要用途如下:
1、短语推荐
2、对于词义方面,以搭配强队作为词-词矩阵的权重,可以用来计算两个词之间的相似度
3、若给定历史语料库,可以通过历时搭配来监测词汇的语义变迁
算法描述:
1、构建语料
2、计算搭配词字典
3、构建共现矩阵
4、计算互信息
5、保存结果
四、代码
# 基于互信息的词语搭配抽取 import collections import math import jieba.posseg as pseg window_size=5 file_path = "D:\workspace\project\\NLPcase\\wordsCollocation\\data\\data.txt" mi_path = "D:\workspace\project\\NLPcase\\wordsCollocation\\data\\data.txt" # 对语料进行处理 def build_corpus(): # 分词 def cut_words(sent): return [word.word for word in pseg.cut(sent) if word.flag[0] not in ['x','w','p','u','c']] sents = [cut_words(sent) for sent in open(file_path,encoding='utf-8').read().split('\n')] return sents # 统计相关的词频 def count_words(sents): words_all = list() for sent in sents: words_all.extend(sent) word_dict = {item[0]:item[1] for item in collections.Counter(words_all).most_common()} return words_all,word_dict # 读取语料 def build_cowords(sents): train_data = list() for sent in sents: for index,word in enumerate(sent): if index <window_size: left = sent[:index] else: left = sent[index-window_size:index] if index+window_size>len(sent): right = sent[index+1:] else: right = sent[index+1:index+1+window_size+1] data = left+right+[sent[index]] if '' in data: data.remove('') train_data.append(data) return train_data # 统计共现矩阵 def count_cowords(train_data): co_dict = dict() for index,data in enumerate(train_data): for index_pre in range(len(data)): for index_post in range(len(data)): if data[index_pre] not in co_dict: co_dict[data[index_pre]] = data[index_post] else: co_dict[data[index_pre]] +="@" +data[index_post] return co_dict # 计算互信息 def compute_words_mi(word_dict,co_dict,sum_tf): def compute_mi(p1,p2,p12): return math.log2(p12)-math.log2(p1)-math.log2(p2) def build_dict(words): return {item[0]:item[1] for item in collections.Counter(words).most_common()} mi_dict = dict() for word,co_words in co_dict: co_word_dict = build_dict(co_words.split('@')) mis_dict = {} for co_word,co_tf in co_word_dict.items(): if co_word == word: continue p1 = word_dict[word]/sum_tf p2 = word_dict[co_word]/sum_tf p12 = co_tf/sum_tf mi = compute_mi(p1,p2,p12) mis_dict[co_word] = mi mis_dict = sorted(mis_dict.items(),key=lambda asd:asd[1],reverse=True) mi_dict[word] = mis_dict return mis_dict # 将产生的互信息文件进行保存 def save_mi(mi_dict): f = open(mi_path,'w+') for word,co_words in mi_dict.items(): con_infos = [item[0]+"@"+str(item[1]) for item in co_words] f.write(word+'\t'+','.join(con_infos)+'\n') f.close() #主函数 def mi_main(): sents = build_corpus() word_dict,sum_tf = count_words(sents) train_data = build_cowords(sents) co_dict = build_cowords(train_data) mi_dict = compute_words_mi(word_dict,co_dict,sum_tf) save_mi(mi_dict)
五、参考资料
https://blog.csdn.net/BigData_Mining/article/details/81279612
https://blog.csdn.net/ranghanqiao5058/article/details/78458815
https://blog.csdn.net/qq_15111861/article/details/80724278
https://github.com/liuhuanyong/WordCollocation