中文信息熵

最新推荐文章于 2023-09-23 14:34:21 发布

fly_guy

最新推荐文章于 2023-09-23 14:34:21 发布

阅读量1k

点赞数

分类专栏： NLP 文章标签： nlp

本文链接：https://blog.csdn.net/weixin_41099940/article/details/115731426

版权

NLP 专栏收录该内容

3 篇文章 2 订阅

订阅专栏

中文信息熵

数据的预处理

创建f_names列表，使用glob库中的glob函数将所有小说文件路径放进去
使用正则表达式获取文件中的中文存入data中
使用jieba分词对小说进行分词处理
最后将所有的分词放在cleaned_data中

import glob
from opencc import OpenCC
opencc = OpenCC('t2s')

path = 'xiaoshuo'

##########################################################################
# Getting file names (book titles)
##########################################################################
f_names = []
for file in glob.glob(path + "/*.txt"):
    f_names.append(file)
print(f_names)

['xiaoshuo\\inf.txt', 'xiaoshuo\\三十三剑客图.txt', 'xiaoshuo\\书剑恩仇录.txt', 'xiaoshuo\\侠客行.txt', 'xiaoshuo\\倚天屠龙记.txt', 'xiaoshuo\\天龙八部.txt', 'xiaoshuo\\射雕英雄传.txt', 'xiaoshuo\\白马啸西风.txt', 'xiaoshuo\\碧血剑.txt', 'xiaoshuo\\神雕侠侣.txt', 'xiaoshuo\\笑傲江湖.txt', 'xiaoshuo\\越女剑.txt', 'xiaoshuo\\连城诀.txt', 'xiaoshuo\\雪山飞狐.txt', 'xiaoshuo\\飞狐外传.txt', 'xiaoshuo\\鸳鸯刀.txt', 'xiaoshuo\\鹿鼎记.txt']

import jieba
import re
cleaned_data = []
for file in f_names:
    with open(file, 'r', encoding='ANSI') as file_object:
        word = file_object.read()
        data = ''.join(re.findall(r'[\u4e00-\u9fa5]', word))
        cut_data = jieba.lcut(data)
        cleaned_data.extend(cut_data)

信息熵的计算

对于单一的随机变量，信息熵定义为
$H(X)=-\sum_{x \in X} P(x) \log (P(x))$

##########################################################################
# Define class Shannon, it will have a method to compute entropy from 
# a list of probabilities
##########################################################################
class Shannon():
    
    def __init__(self, ):
        None

    ##########################################################################
    # This function computes the surprisal given the probability of the outcome
    ##########################################################################
    def compute_surprisal(self, p_xi):
        return np.log2(1/p_xi)
    
    ##########################################################################
    # This function computes the entropy given a discrete probability dist.
    ##########################################################################
    def compute_entropy(self, p_x):
        return -np.sum(p_x * np.log2(p_x))

N-gram 语言模型

为了计算语料库 S 的信息熵𝐻(𝑆)，需要统计得到整个语料库中各个词组的概率分布𝑃(𝑆)，对于语料库中的某个句子s𝑖 = {𝑤1, w2, … wK𝑖}，句子出现的概率预测等于句子中
每个词组按照顺序出现的条件概率，即：
$P\left(\mathrm{~s}_{i}\right)=P\left(\mathrm{w}_{\mathrm{K}_{i}-1}\right) * P\left(\mathrm{w}_{2} \mid w_{1}\right) * 1 \ldots * P\left(\mathrm{w}_{\mathrm{K}_{i}} \mid \mathrm{w}_{1}, \mathrm{w}_{2} \ldots \mathrm{w}_{\mathrm{K}_{i}-1}\right)$

但是由于句子包含的词组的数量不一致，该概率模型的参数规模会随着句子的长度
增大而变得难以接受，计算量会变得非常大，所以通常在计算以及分析语言模型时，次
啊应 N 元组(N-gram)模型，即对每一个词组，计算器条件概率时值考虑其前 N-1 个词组，即计算公式为： $P\left(w_{i} \mid w_{1}, w_{2} \ldots w_{i-1}\right)=P\left(w_{i} \mid w_{i-N+1}, w_{i-N+2} \ldots w_{i-1}\right)$

为了简化计算，通常使用的语言模型主要有一元组模型(N=1，uni-gram)、二元组模
型(N=2，bi-gram)以及三元组模型(N=3，tri-gram)。

import numpy  as np 
import nltk
##########################################################################
# This    computes F_n given the text and the size of the n-grams
##########################################################################
def compute_Fn(text, n):
    shannon = Shannon()
    if n == 1:
        # Create 1-gram
        ngram  = nltk.ngrams(text, n)
        # Compute frequency of each letter
        fdist = nltk.FreqDist(ngram) 
        p  = np.array(list(fdist.values())) / np.array(list(fdist.values())).sum()
        return shannon.compute_entropy(p)
    else:
        # Create (n-1)-gram
        nm1gram  = nltk.ngrams(text, n-1)
        # Compute frequency of each letter
        fdist1 = nltk.FreqDist(nm1gram) 
        p1  = np.array(list(fdist1.values())) / np.array(list(fdist1.values())).sum()
        # Create n-gram
        ngram  = nltk.ngrams(text, n)
        
        # Compute frequency of each letter
        fdist2 = nltk.FreqDist(ngram) 
        sortDict=sorted(fdist2.keys(),key=lambda x:fdist2[x],reverse=True)
        data={}
        for i in sortDict[0:10]:
            data[i]=fdist2[i]
        print(data)
        p2  = np.array(list(fdist2.values())) / np.array(list(fdist2.values())).sum()
        return shannon.compute_entropy(p2) - shannon.compute_entropy(p1)

compute_Fn(cleaned_data, 1)

12.178976256444095

compute_Fn(cleaned_data, 2)

{('道', '你'): 5825, ('叫', '道'): 5033, ('道', '我'): 5012, ('笑', '道'): 4266, ('听', '得'): 4218, ('都', '是'): 3923, ('了', '他'): 3784, ('他', '的'): 3509, ('也', '是'): 3212, ('的', '一声'): 3127}





6.950433116176587

compute_Fn(cleaned_data, 3)

{('只', '听', '得'): 1615, ('忽', '听', '得'): 1138, ('站', '起身', '来'): 733, ('哼', '了', '一声'): 581, ('笑', '道', '你'): 576, ('吃', '了', '一惊'): 539, ('啊', '的', '一声'): 525, ('点', '了', '点头'): 505, ('说', '到', '这里'): 476, ('了', '他', '的'): 461}





2.299638729005366

实验结果

本实验使用了一元组、二元组、三元组三种语言模型，对同一语料库进行了信息熵的计算，得到了 uni-gram 下中文信息熵为 12.179、bi-gram 下中文信息熵为 6.950、tri-gram下中文信息熵为 2.300 的实验结果。

比较模型之间的差别和中文与英文之间的差别后，得到了两个结论：

在 N-gram 模型中，N 越大，词组间考虑的前后文关系越详尽，词组分布越简单，文本的信息熵就越小
中文比英文信息熵大，所需的比特数更多。信息熵可看作消除不确定性所需信息
量的度量，即未知事件可能含有的信息量，因此信息熵越高，说明所含信息量越大，信
息的不确定性越大。

信息熵度量信息量的作用可以用在很多场景中，如评判压缩算法、决策树选择决策
规则等，被机器学习算法广泛使用。

fly_guy

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
中文信息熵

中文信息熵数据的预处理创建f_names列表，使用glob库中的glob函数将所有小说文件路径放进去使用正则表达式获取文件中的中文存入data中使用jieba分词对小说进行分词处理最后将所有的分词放在cleaned_data中import globfrom opencc import OpenCCopencc = OpenCC('t2s')path = 'xiaoshuo'######################################################
复制链接

扫一扫

专栏目录