Amount of Information - Formula express of Entropy - KL Divergence - Cross Entropy 信息量-熵的公式-KL散度-交叉熵

Amount of Information - Formula express of Entropy - KL Divergence - Cross Entropy

Amount of Information

When we are reading papers or news, why not every article is equally important to us. Some give great enlightenment, but some are the old lady’s footwear-smelly and long (老太太的裹脚布,又臭又长。实在不会翻译了), talking nosense. The reason is that their amount of information is different! But how?

It is easy to find that:
The amount of information is only related to the probability of a certain event.

The smaller the probability that something happened, the greater amount of information is indicated, like Yasuo getting penta kill in your team;
The larger the probability, the smaller the amount of information is indicated. For example, my first blog post will be read by many people (it will happen, there is no information).

Fomula express of Entropy

Let’s invent the formula express of Entropy together:
Facing a new concept, we should consider its mathematical properties Additivity, which means

F(event A and B and C…) = F(event A) + F(event B) + F(event C) + …

and as discussed above, Amount of Information only relates to probability, so:

F(p(A, B, C…)) = F(p(A)) + F(p(B)) + F(p(C )) + …

And as we all know, if event A and B are not related, then: p(A, B) = p(A)*p(B)
In addition, Amount of Information should not be negative. We can design:

F(A) = -log(p(A))

As for the choice of base, it is the benchmark that we want to choose. Just like the temperature of 0° and 100° are the melting point of standard pressure ice and the boiling point of water. Theoretically, setting 0° and 100° base on alcohol won’t cause any problem.
In information theory, the base is generally 2, which can be understood as: based on the probability of “tossing a coin with heads up”, it can be calculated as F’(tossing a coin with heads up) = -log(p (tossing a coin with heads up)) . Why take 2? Because the computer is binary, the information transmission is easy to calculate with 2;
In the neural network, e is generally used as the base, nothing else, the derivative is easy to be calculated for gradient decent.

Let’s get down to business, Entropy is defined as: the degree of uncertainty of information, in math, it equal to the expectation of Information Amount, then we get:
在这里插入图片描述

Relative Entropy(KL Divergence)

KL Divergence is used to measure the distance between two random events.

在这里插入图片描述
In my understanding, KL Divergence is like measure event B from A point of view. Maybe it is more clear in this form:
在这里插入图片描述

Cross Entropy

From KL Divergence:
在这里插入图片描述
The second part is called Cross Entropy:
在这里插入图片描述
It is widely used in loss function, because p(x) usually means the real distribution of data, which won’t change during training. The second part is what we want to decrease.

本文主要是确保自己明白了,如果我这英语都能说明白,那我是真的明白了。有内容或语法错误欢迎指正!

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值