信息论(Information Theory)是概率论与数理统计的一个分枝。用于信息处理、信息熵、通信系统、数据传输、率失真理论、密码学、信噪比、数据压缩和相关课题。
基本概念
先说明一点:在信息论里面对数log默认都是指以2为底数。
自信息量
id="iframe_0.07272270726033025" src="data:text/html;charset=utf8,%3Cimg%20id=%22img%22%20src=%22http://www.forkosh.com/mathtex.cgi?%5Cdpi%7B150%7DI(x_i)=-log%5C%2520p(x_i)%5Cqquad%5Ccdots%5Ccdots(1)&_=2655785%22%20style=%22border:none;max-width:1472px%22%3E%3Cscript%3Ewindow.onload%20=%20function%20()%20%7Bvar%20img%20=%20document.getElementById('img');%20window.parent.postMessage(%7BiframeId:'iframe_0.07272270726033025',width:img.width,height:img.height%7D,%20'http://www.cnblogs.com');%7D%3C/script%3E" frameborder="0" scrolling="no" style="border-style: none; border-width: initial; width: 381px; height: 21px;">
联合自信息量
id="iframe_0.648536113541897" src="data:text/html;charset=utf8,%3Cimg%20id=%22img%22%20src=%22http://www.forkosh.com/mathtex.cgi?%5Cdpi%7B150%7DI(x_i,y_j)=-log%5C%2520p(x_i,y_j)%5Cqquad%5Ccdots%5Ccdots(2)&_=2655785%22%20style=%22border:none;max-width:1472px%22%3E%3Cscript%3Ewindow.onload%20=%20function%20()%20%7Bvar%20img%20=%20document.getElementById('img');%20window.parent.postMessage(%7BiframeId:'iframe_0.648536113541897',width:img.width,height:img.height%7D,%20'http://www.cnblogs.com');%7D%3C/script%3E" frameborder="0" scrolling="no" style="border-style: none; border-width: initial; width: 437px; height: 22px;">
条件自信息量
id="iframe_0.2949264895033967" src="data:text/html;charset=utf8,%3Cimg%20id=%22img%22%20src=%22http://www.forkosh.com/mathtex.cgi?%5Cdpi%7B150%7DI(x_i%7Cy_j)=-log%5C%2520p(x_i%7Cy_j)%5Cqquad%5Ccdots%5Ccdots(3)&_=2655785%22%20style=%22border:none;max-width:1472px%22%3E%3Cscript%3Ewindow.onload%20=%20function%20()%20%7Bvar%20img%20=%20document.getElementById('img');%20window.parent.postMessage(%7BiframeId:'iframe_0.2949264895033967',width:img.width,height:img.height%7D,%20'http://www.cnblogs.com');%7D%3C/script%3E" frameborder="0" scrolling="no" style="border-style: none; border-width: initial; width: 430px; height: 22px;">
信息熵
id="iframe_0.9058567661759973" src="data:text/html;charset=utf8,%3Cimg%20id=%22img%22%20src=%22http://www.forkosh.com/mathtex.cgi?%5Cdpi%7B150%7DH(X)=-%5Csum%7Bp(x_i)log%5C%2520p(x_i)%7D%5Cqquad%5Ccdots%5Ccdots(4)&_=2655785%22%20style=%22border:none;max-width:1472px%22%3E%3Cscript%3Ewindow.onload%20=%20function%20()%20%7Bvar%20img%20=%20document.getElementById('img');%20window.parent.postMessage(%7BiframeId:'iframe_0.9058567661759973',width:img.width,height:img.height%7D,%20'http://www.cnblogs.com');%7D%3C/script%3E" frameborder="0" scrolling="no" style="border-style: none; border-width: initial; width: 471px; height: 30px;">
条件熵
id="iframe_0.8529688043682924" src="data:text/html;charset=utf8,%3Cimg%20id=%22img%22%20src=%22http://www.forkosh.com/mathtex.cgi?%5Cdpi%7B150%7DH(X%7CY)=-%5Csum_i%5Csum_j%7Bp(x_i,y_j)log%5C%2520p(x_i%7Cy_j)%7D=%5Csum_i%5Csum_j%7Bp(x_i,y_j)I(x_i%7Cy_j)%7D%5Cqquad%5Ccdots%5Ccdots(5)&_=2655785%22%20style=%22border:none;max-width:1472px%22%3E%3Cscript%3Ewindow.onload%20=%20function%20()%20%7Bvar%20img%20=%20document.getElementById('img');%20window.parent.postMessage(%7BiframeId:'iframe_0.8529688043682924',width:img.width,height:img.height%7D,%20'http://www.cnblogs.com');%7D%3C/script%3E" frameborder="0" scrolling="no" style="border-style: none; border-width: initial; width: 818px; height: 47px;">
联合熵
id="iframe_0.6519812017031767" src="data:text/html;charset=utf8,%3Cimg%20id=%22img%22%20src=%22http://www.forkosh.com/mathtex.cgi?%5Cdpi%7B150%7DH(X,Y)=-%5Csum_i%5Csum_j%7Bp(x_i,y_j)log%5C%2520p(x_i,y_j)%7D=%5Csum_i%5Csum_j%7Bp(x_i,y_j)I(x_i,y_j)%7D%5Cqquad%5Ccdots%5Ccdots(6)&_=2655785%22%20style=%22border:none;max-width:1472px%22%3E%3Cscript%3Ewindow.onload%20=%20function%20()%20%7Bvar%20img%20=%20document.getElementById('img');%20window.parent.postMessage(%7BiframeId:'iframe_0.6519812017031767',width:img.width,height:img.height%7D,%20'http://www.cnblogs.com');%7D%3C/script%3E" frameborder="0" scrolling="no" style="border-style: none; border-width: initial; width: 827px; height: 47px;">
根据链式规则,有
id="iframe_0.6201709031122984" src="data:text/html;charset=utf8,%3Cimg%20id=%22img%22%20src=%22http://www.forkosh.com/mathtex.cgi?%5Cdpi%7B150%7DH(X,Y)=H(X)+H(Y%7CX)=H(Y)+H(X%7CY)%5Cqquad%5Ccdots%5Ccdots(a)&_=2655785%22%20style=%22border:none;max-width:1472px%22%3E%3Cscript%3Ewindow.onload%20=%20function%20()%20%7Bvar%20img%20=%20document.getElementById('img');%20window.parent.postMessage(%7BiframeId:'iframe_0.6201709031122984',width:img.width,height:img.height%7D,%20'http://www.cnblogs.com');%7D%3C/script%3E" frameborder="0" scrolling="no" style="border-style: none; border-width: initial; width: 655px; height: 21px;">
可以得出
id="iframe_0.6841955452277333" src="data:text/html;charset=utf8,%3Cimg%20id=%22img%22%20src=%22http://www.forkosh.com/mathtex.cgi?%5Cdpi%7B150%7DH(X)-H(X%7CY)=H(Y)-H(Y%7CX)%5Cqquad%5Ccdots%5Ccdots(b)&_=2655785%22%20style=%22border:none;max-width:1472px%22%3E%3Cscript%3Ewindow.onload%20=%20function%20()%20%7Bvar%20img%20=%20document.getElementById('img');%20window.parent.postMessage(%7BiframeId:'iframe_0.6841955452277333',width:img.width,height:img.height%7D,%20'http://www.cnblogs.com');%7D%3C/script%3E" frameborder="0" scrolling="no" style="border-style: none; border-width: initial; width: 547px; height: 21px;">
信息增益Information Gain
系统原先的熵是H(X),在条件Y已知的情况下系统的熵(条件熵)为H(X|Y),信息增益就是这两个熵的差值。
id="iframe_0.48057873977294796" src="data:text/html;charset=utf8,%3Cimg%20id=%22img%22%20src=%22http://www.forkosh.com/mathtex.cgi?%5Cdpi%7B150%7DIG=H(X)-H(X%7CY)%5Cqquad%5Ccdots%5Ccdots(7)&_=2655785%22%20style=%22border:none;max-width:1472px%22%3E%3Cscript%3Ewindow.onload%20=%20function%20()%20%7Bvar%20img%20=%20document.getElementById('img');%20window.parent.postMessage(%7BiframeId:'iframe_0.48057873977294796',width:img.width,height:img.height%7D,%20'http://www.cnblogs.com');%7D%3C/script%3E" frameborder="0" scrolling="no" style="border-style: none; border-width: initial; width: 422px; height: 21px;">
熵表示系统的不确定度,所以信息增益越大表示条件Y对于确定系统的贡献越大。
信息增益在特征选择中的应用
由(7)式可以直接推出词条w的信息增益,(7)式中的X代表类别的集合,Y代表w存在和不存在两种情况
id="iframe_0.5742985089119299" src="data:text/html;charset=utf8,%3Cimg%20id=%22img%22%20src=%22http://www.forkosh.com/mathtex.cgi?%5Cdpi%7B150%7DIG(w)=H(C)-H(C%7Cw)&_=2655785%22%20style=%22border:none;max-width:1472px%22%3E%3Cscript%3Ewindow.onload%20=%20function%20()%20%7Bvar%20img%20=%20document.getElementById('img');%20window.parent.postMessage(%7BiframeId:'iframe_0.5742985089119299',width:img.width,height:img.height%7D,%20'http://www.cnblogs.com');%7D%3C/script%3E" frameborder="0" scrolling="no" style="border-style: none; border-width: initial; width: 320px; height: 21px;">
id="iframe_0.24448528279164283" src="data:text/html;charset=utf8,%3Cimg%20id=%22img%22%20src=%22http://www.forkosh.com/mathtex.cgi?%5Cdpi%7B150%7D=-%5Csum_i%7Bp(c_i)log%5C%2520p(c_i)%7D+%5Csum_i%7Bp(c_i,w)log%5C%2520p(c_i%7Cw)%7D+%5Csum_i%7Bp(c_i,%5Coverline%7Bw%7D)log%5C%2520p(c_i%7C%5Coverline%7Bw%7D)%7D&_=2655785%22%20style=%22border:none;max-width:1472px%22%3E%3Cscript%3Ewindow.onload%20=%20function%20()%20%7Bvar%20img%20=%20document.getElementById('img');%20window.parent.postMessage(%7BiframeId:'iframe_0.24448528279164283',width:img.width,height:img.height%7D,%20'http://www.cnblogs.com');%7D%3C/script%3E" frameborder="0" scrolling="no" style="border-style: none; border-width: initial; width: 713px; height: 44px;">
id="iframe_0.6121162982477824" src="data:text/html;charset=utf8,%3Cimg%20id=%22img%22%20src=%22http://www.forkosh.com/mathtex.cgi?%5Cdpi%7B150%7D=-%5Csum_i%7Bp(c_i)log%5C%2520p(c_i)%7D+p(w)%5Csum_i%7Bp(c_i%7Cw)log%5C%2520p(c_i%7Cw)%7D+p(%5Coverline%7Bw%7D)%5Csum_i%7Bp(c_i%7C%5Coverline%7Bw%7D)log%5C%2520p(c_i%7C%5Coverline%7Bw%7D)%7D%5Cqquad%5Ccdots%5Ccdots(8)&_=2655785%22%20style=%22border:none;max-width:1472px%22%3E%3Cscript%3Ewindow.onload%20=%20function%20()%20%7Bvar%20img%20=%20document.getElementById('img');%20window.parent.postMessage(%7BiframeId:'iframe_0.6121162982477824',width:img.width,height:img.height%7D,%20'http://www.cnblogs.com');%7D%3C/script%3E" frameborder="0" scrolling="no" style="border-style: none; border-width: initial; width: 918px; height: 44px;">
p(ci)是第i类文档出现的概率;p(w)是在整个训练集中包含w的文档占全部文档的比例;p(ci|w)表示出现w的文档集合中属于类别i的文档所占的比例; id="iframe_0.6698046539099198" src="data:text/html;charset=utf8,%3Cimg%20id=%22img%22%20src=%22http://www.forkosh.com/mathtex.cgi?p(c_i%7C%5Coverline%7Bw%7D)&_=2655785%22%20style=%22border:none;max-width:1472px%22%3E%3Cscript%3Ewindow.onload%20=%20function%20()%20%7Bvar%20img%20=%20document.getElementById('img');%20window.parent.postMessage(%7BiframeId:'iframe_0.6698046539099198',width:img.width,height:img.height%7D,%20'http://www.cnblogs.com');%7D%3C/script%3E" frameborder="0" scrolling="no" style="border-style: none; border-width: initial; width: 120px; height: 18px;">表示没有出现w的文档集合中属于类别i的文档所占的比例。
信息增益在决策树中的应用
outlook | temperature | humidity | windy | play |
sunny | hot | high | FALSE | no |
sunny | hot | high | TRUE | no |
overcast | hot | high | FALSE | yes |
rainy | mild | high | FALSE | yes |
rainy | cool | normal | FALSE | yes |
rainy | cool | normal | TRUE | no |
overcast | cool | normal | TRUE | yes |
sunny | mild | high | FALSE | no |
sunny | cool | normal | FALSE | yes |
rainy | mild | normal | FALSE | yes |
sunny | mild | normal | TRUE | yes |
overcast | mild | high | TRUE | yes |
overcast | hot | normal | FALSE | yes |
rainy | mild | high | TRUE | no |
(7)式中的X表示打球和不打球两种情况。
只看最后一列我们得到打球的概率是9/14,不打球的概率是5/14。因此在没有任何先验信息的情况下,系统的熵(不确定性)为
id="iframe_0.766519105254774" src="data:text/html;charset=utf8,%3Cimg%20id=%22img%22%20src=%22http://www.forkosh.com/mathtex.cgi?%5Cdpi%7B130%7DH(X)=-%5Cfrac%7B9%7D%7B14%7Dlog%5Cfrac%7B9%7D%7B14%7D-%5Cfrac%7B5%7D%7B14%7Dlog%5Cfrac%7B5%7D%7B14%7D=0.94&_=2655785%22%20style=%22border:none;max-width:1472px%22%3E%3Cscript%3Ewindow.onload%20=%20function%20()%20%7Bvar%20img%20=%20document.getElementById('img');%20window.parent.postMessage(%7BiframeId:'iframe_0.766519105254774',width:img.width,height:img.height%7D,%20'http://www.cnblogs.com');%7D%3C/script%3E" frameborder="0" scrolling="no" style="border-style: none; border-width: initial; width: 374px; height: 38px;">
outlook | temperature | humidity | windy | play | |||||||||
yes | no | yes | no | yes | no | yes | no | yes | no | ||||
sunny | 2 | 3 | hot | 2 | 2 | high | 3 | 4 | FALSE | 6 | 2 | 9 | 5 |
overcast | 4 | 0 | mild | 4 | 2 | normal | 6 | 1 | TRUR | 3 | 3 | ||
rainy | 3 | 2 | cool | 3 | 1 |
如果选outlook作为决策树的根节点,(7)式中的Y为集合{sunny、overcast、rainy},此时的条件熵为
id="iframe_0.9261269500114175" src="data:text/html;charset=utf8,%3Cimg%20id=%22img%22%20src=%22http://www.forkosh.com/mathtex.cgi?%5Cdpi%7B130%7DH(X%7CY)=-p(sunny,yes)log%5C%2520p(yes%7Csunny)-p(sunny,no)log%5C%2520p(no%7Csunny)&_=2655785%22%20style=%22border:none;max-width:1472px%22%3E%3Cscript%3Ewindow.onload%20=%20function%20()%20%7Bvar%20img%20=%20document.getElementById('img');%20window.parent.postMessage(%7BiframeId:'iframe_0.9261269500114175',width:img.width,height:img.height%7D,%20'http://www.cnblogs.com');%7D%3C/script%3E" frameborder="0" scrolling="no" style="border-style: none; border-width: initial; width: 657px; height: 18px;">
id="iframe_0.5764092462934913" src="data:text/html;charset=utf8,%3Cimg%20id=%22img%22%20src=%22http://www.forkosh.com/mathtex.cgi?%5Cdpi%7B130%7D-p(overcast,yes)log%5C%2520p(yse%7Covercast)-p(overcast,no)log%5C%2520p(no%7Covercast)&_=2655785%22%20style=%22border:none;max-width:1472px%22%3E%3Cscript%3Ewindow.onload%20=%20function%20()%20%7Bvar%20img%20=%20document.getElementById('img');%20window.parent.postMessage(%7BiframeId:'iframe_0.5764092462934913',width:img.width,height:img.height%7D,%20'http://www.cnblogs.com');%7D%3C/script%3E" frameborder="0" scrolling="no" style="border-style: none; border-width: initial; width: 636px; height: 18px;">
id="iframe_0.5847312562266591" src="data:text/html;charset=utf8,%3Cimg%20id=%22img%22%20src=%22http://www.forkosh.com/mathtex.cgi?%5Cdpi%7B130%7D-p(rainy,yes)log%5C%2520p(yse%7Crainy)-p(rainy,no)log%5C%2520p(no%7Crainy)&_=2655785%22%20style=%22border:none;max-width:1472px%22%3E%3Cscript%3Ewindow.onload%20=%20function%20()%20%7Bvar%20img%20=%20document.getElementById('img');%20window.parent.postMessage(%7BiframeId:'iframe_0.5847312562266591',width:img.width,height:img.height%7D,%20'http://www.cnblogs.com');%7D%3C/script%3E" frameborder="0" scrolling="no" style="border-style: none; border-width: initial; width: 557px; height: 18px;">
id="iframe_0.35781021206979524" src="data:text/html;charset=utf8,%3Cimg%20id=%22img%22%20src=%22http://www.forkosh.com/mathtex.cgi?%5Cdpi%7B130%7D=-p(sunny)%5Bp(yes%7Csunny)log%5C%2520p(yes%7Csunny)+p(no%7Csunny)log%5C%2520p(no%7Csunny)%5D&_=2655785%22%20style=%22border:none;max-width:1472px%22%3E%3Cscript%3Ewindow.onload%20=%20function%20()%20%7Bvar%20img%20=%20document.getElementById('img');%20window.parent.postMessage(%7BiframeId:'iframe_0.35781021206979524',width:img.width,height:img.height%7D,%20'http://www.cnblogs.com');%7D%3C/script%3E" frameborder="0" scrolling="no" style="border-style: none; border-width: initial; width: 661px; height: 19px;">
id="iframe_0.4975595428274322" src="data:text/html;charset=utf8,%3Cimg%20id=%22img%22%20src=%22http://www.forkosh.com/mathtex.cgi?%5Cdpi%7B130%7D-p(overcast)%5Bp(yes%7Covercast)log%5C%2520p(yes%7Covercast)+p(no%7Covercast)log%5C%2520p(no%7Covercast)%5D&_=2655785%22%20style=%22border:none;max-width:1472px%22%3E%3Cscript%3Ewindow.onload%20=%20function%20()%20%7Bvar%20img%20=%20document.getElementById('img');%20window.parent.postMessage(%7BiframeId:'iframe_0.4975595428274322',width:img.width,height:img.height%7D,%20'http://www.cnblogs.com');%7D%3C/script%3E" frameborder="0" scrolling="no" style="border-style: none; border-width: initial; width: 730px; height: 19px;">
id="iframe_0.525897166119291" src="data:text/html;charset=utf8,%3Cimg%20id=%22img%22%20src=%22http://www.forkosh.com/mathtex.cgi?%5Cdpi%7B130%7D-p(rainy)%5Bp(yes%7Crainy)log%5C%2520p(yes%7Crainy)+p(no%7Crainy)log%5C%2520p(no%7Crainy)%5D&_=2655785%22%20style=%22border:none;max-width:1472px%22%3E%3Cscript%3Ewindow.onload%20=%20function%20()%20%7Bvar%20img%20=%20document.getElementById('img');%20window.parent.postMessage(%7BiframeId:'iframe_0.525897166119291',width:img.width,height:img.height%7D,%20'http://www.cnblogs.com');%7D%3C/script%3E" frameborder="0" scrolling="no" style="border-style: none; border-width: initial; width: 619px; height: 19px;">
id="iframe_0.9924912873213931" src="data:text/html;charset=utf8,%3Cimg%20id=%22img%22%20src=%22http://www.forkosh.com/mathtex.cgi?%5Cdpi%7B130%7D=-%5Cfrac%7B5%7D%7B14%7D%5B%5Cfrac%7B2%7D%7B5%7Dlog%5Cfrac%7B2%7D%7B5%7D+%5Cfrac%7B3%7D%7B5%7Dlog%5Cfrac%7B3%7D%7B5%7D%5D-%5Cfrac%7B4%7D%7B14%7D%5B%5Cfrac%7B4%7D%7B4%7Dlog%5Cfrac%7B4%7D%7B4%7D+0log0%5D-%5Cfrac%7B5%7D%7B14%7D%5B%5Cfrac%7B3%7D%7B5%7Dlog%5Cfrac%7B3%7D%7B5%7D+%5Cfrac%7B2%7D%7B5%7Dlog%5Cfrac%7B2%7D%7B5%7D%5D=0.693&_=2655785%22%20style=%22border:none;max-width:1472px%22%3E%3Cscript%3Ewindow.onload%20=%20function%20()%20%7Bvar%20img%20=%20document.getElementById('img');%20window.parent.postMessage(%7BiframeId:'iframe_0.9924912873213931',width:img.width,height:img.height%7D,%20'http://www.cnblogs.com');%7D%3C/script%3E" frameborder="0" scrolling="no" style="border-style: none; border-width: initial; width: 631px; height: 38px;">
即选择outlook作为决策树的根节点时,信息增益为0.94-0.693=0.247。
同样方法计算当选择temperature、humidity、windy作为根节点时系统的信息增益,选择IG值最大的作为最终的根节点。
互信息Mutual Informantion
yj对xi的互信息定义为后验概率与先验概率比值的对数。
id="iframe_0.28724708915711017" src="data:text/html;charset=utf8,%3Cimg%20id=%22img%22%20src=%22http://www.forkosh.com/mathtex.cgi?%5Cdpi%7B150%7DI(x_i;y_j)=log%5C%2520%5Cfrac%7Bp(x_i%7Cy_j)%7D%7Bp(x_i)%7D=I(x_i)-I(x_i%7Cy_j)%5Cqquad%5Ccdots%5Ccdots(9)&_=2655785%22%20style=%22border:none;max-width:1472px%22%3E%3Cscript%3Ewindow.onload%20=%20function%20()%20%7Bvar%20img%20=%20document.getElementById('img');%20window.parent.postMessage(%7BiframeId:'iframe_0.28724708915711017',width:img.width,height:img.height%7D,%20'http://www.cnblogs.com');%7D%3C/script%3E" frameborder="0" scrolling="no" style="border-style: none; border-width: initial; width: 591px; height: 49px;">
互信息越大,表明yj对于确定xi的取值的贡献度越大。
系统的平均互信息
id="iframe_0.885056871587588" src="data:text/html;charset=utf8,%3Cimg%20id=%22img%22%20src=%22http://www.forkosh.com/mathtex.cgi?%5Cdpi%7B150%7DI(X;Y)=%5Csum_i%5Csum_j%7Bp(x_i,y_j)I(x_i;y_j)%7D=%5Csum_i%5Csum_j%7Bp(x_i,y_j)log%5C%2520%5Cfrac%7Bp(x_i%7Cy_j)%7D%7Bp(x_i)%7D%7D&_=2655785%22%20style=%22border:none;max-width:1472px%22%3E%3Cscript%3Ewindow.onload%20=%20function%20()%20%7Bvar%20img%20=%20document.getElementById('img');%20window.parent.postMessage(%7BiframeId:'iframe_0.885056871587588',width:img.width,height:img.height%7D,%20'http://www.cnblogs.com');%7D%3C/script%3E" frameborder="0" scrolling="no" style="border-style: none; border-width: initial; width: 675px; height: 57px;">
id="iframe_0.8679329796549622" src="data:text/html;charset=utf8,%3Cimg%20id=%22img%22%20src=%22http://www.forkosh.com/mathtex.cgi?%5Cdpi%7B150%7D=H(X)-H(X%7CY)=H(Y)-H(Y%7CX)%5Cqquad%5Ccdots%5Ccdots(10)&_=2655785%22%20style=%22border:none;max-width:1472px%22%3E%3Cscript%3Ewindow.onload%20=%20function%20()%20%7Bvar%20img%20=%20document.getElementById('img');%20window.parent.postMessage(%7BiframeId:'iframe_0.8679329796549622',width:img.width,height:img.height%7D,%20'http://www.cnblogs.com');%7D%3C/script%3E" frameborder="0" scrolling="no" style="border-style: none; border-width: initial; width: 580px; height: 21px;">
可见平均互信息就是信息增益!
互信息在特征选择中的应用
词条w与类别ci的互信息为
id="iframe_0.07588995537304655" src="data:text/html;charset=utf8,%3Cimg%20id=%22img%22%20src=%22http://www.forkosh.com/mathtex.cgi?%5Cdpi%7B150%7DMI(w,c_i)=log%5C%2520%5Cfrac%7Bp(w%7Cc_i)%7D%7Bp(w)%7D%5Cqquad%5Ccdots%5Ccdots(11)&_=2655785%22%20style=%22border:none;max-width:1472px%22%3E%3Cscript%3Ewindow.onload%20=%20function%20()%20%7Bvar%20img%20=%20document.getElementById('img');%20window.parent.postMessage(%7BiframeId:'iframe_0.07588995537304655',width:img.width,height:img.height%7D,%20'http://www.cnblogs.com');%7D%3C/script%3E" frameborder="0" scrolling="no" style="border-style: none; border-width: initial; width: 442px; height: 49px;">
p(w)表示出现w的文档点总文档数目的比例,p(w|ci)表示在类别ci中出现w的文档点总文档数目的比例。
对整个系统来说,词条w的互信息为
id="iframe_0.7421544369295026" src="data:text/html;charset=utf8,%3Cimg%20id=%22img%22%20src=%22http://www.forkosh.com/mathtex.cgi?%5Cdpi%7B150%7DMI_%7Bavg%7D(w,c_i)=%5Csum_ip(c_i)log%5C%2520%5Cfrac%7Bp(w%7Cc_i)%7D%7Bp(w)%7D%5Cqquad%5Ccdots%5Ccdots(12)&_=2655785%22%20style=%22border:none;max-width:1472px%22%3E%3Cscript%3Ewindow.onload%20=%20function%20()%20%7Bvar%20img%20=%20document.getElementById('img');%20window.parent.postMessage(%7BiframeId:'iframe_0.7421544369295026',width:img.width,height:img.height%7D,%20'http://www.cnblogs.com');%7D%3C/script%3E" frameborder="0" scrolling="no" style="border-style: none; border-width: initial; width: 544px; height: 54px;">
最后选互信息最大的前K个词条作为特征项。
交叉熵Cross Entropy
交叉熵(Cross Entropy)、相对熵(Relative Entropy)、KL距离(Kullback-Leibler Distance) 、KL散度(Kullback-Leibler Divergence)指的都是同一个概念。
交叉熵是一种万能的Monte-Carlo技术,常用于稀有事件的仿真建模、多峰函数的最优化问题。交叉熵技术已用于解决经典的旅行商问题、背包问题、最短路问题、最大割问题等。这里给一个文章链接:A Tutorial on the Cross-Entropy Method
交叉熵算法的推导过程中又牵扯出来一个问题:如何求一个数学期望?常用的方法有这么几种:
- 概率方法,比如Crude Monte-Carlo
- 测度变换法change of measure
- 偏微分方程的变量代换法
- Green函数法
- Fourier变换法
在实际中变量 x x服从的概率分布 h h往往是不知道的,我们会用 g g来近似地代替 h h----这本质上是一种函数估计。有一种度量 g g和 h h相近程度的方法叫 Kullback-Leibler距离,它是求 g g和 h h在对数上的期望:
当 x x是离散变量时,KL距离定义为:
通常选取g和h具有相同的概率分布类型(比如已知h是指数分布,那么就选g也是指数分布)----参数估计,只是pdf参数不一样(实际上h中的参数根本就是未知的)。
由KL距离的公式可以看到这种距离具有不对称性,即 D(g,h)≠D(h,g) D(g,h)≠D(h,g),所以有时候人们会使用另外一种具有对称性的距离: D(g,h)+D(h,g) D(g,h)+D(h,g)
基于期望交叉熵的特征项选择
p(ci|w)表示在出现词条w时文档属于类别ci的概率。
交叉熵反应了文本类别的概率分布与在出现了某个词条的情况下文本类别的概率分布之间的距离。词条的交叉熵越大,对文本类别分布影响也就越大。所以选CE最大的K个词条作为最终的特征项。
如果使用具有对称性的交叉熵,那公式就变成了