最近读了论文Surface Form Competition: Why the Highest Probability Answer Isn't Always Right,在读代码时,发现论文对于点互信息的计算,是通过交叉熵进行的。查了一番资料还是不理解,在这里和大伙儿探讨下。
论文提出通过计算领域点互信息PMIDC,来解决表面形式竞争的问题。具体的计算公式为:
点互信息PMI(关于点互信息的介绍可以看这篇博客信息熵、KL散度、交叉熵、互信息、点互信息-CSDN博客):
PMIDC通过引入“domain”的信息(because),明确了任务范围
举例来说,
Premise(X):The bar closed because.
Domain Premise(Xdomain):because.
Hypothesis 1(y1):it was crowded.
Hypothesis 2(y2):it was 3am.
以y1为例,
对于LM,我们想要计算P(y1|x),即P(it was crowded | The bar closed because)
对于PMIDC,我们想要计算P(y1|x)/P(y1|Xdomain),其中P(y1|Xdomain)即 P(it was crowded | because)
对于y2也是同理,我们实际上关注的是使表达式值最大化的yi中i的取值。
PMIDC和PMI的计算代码如下:
我的理解:
本来 PMIDC = P(y|x) / P(y|domain)
而 交叉熵 H(y,x) = -logP(y|x),
所以可以通过交叉熵计算PMIDC : PMIDC= H(y|domain)-H(y|x)
#这里H(y|domain)实际上指的是H(model(domain),y)
#PMIDC= H(y|domain)-H(y|x)
dcpmi = [ce_0 - ce_1 for ce_0,ce_1 in zip(domain_cond_ce, cond_ce)]
#PMI(x,y)=H(y)-H(y|x) H(y):随机变量y的熵;H(y|x):给定x的条件下y的熵,cond_ce;
pmi = [ce_0 - ce_1 for ce_0,ce_1 in zip(uncond_ce, cond_ce)]
#计算H(logits,targets),得到 -logP(targets|inputs),即-logP(y|x)
def cross_entropy_list(inputs, targets):
# get logits from the model
with torch.no_grad():
input_ids = input_ids.to(device)
logits = model(input_ids).logits.cpu()[:,:-1].contiguous()
# get cross-entropies given the logits
logit_shape = logits.shape
logits = logits.view(-1, logit_shape[-1])
#计算预测值logits和真实值label之间的交叉熵
ce_list = F.cross_entropy(logits, labels[:,1:].contiguous().view(-1), reduction='none')
ce_list = ce_list.view(n_seqs, max_len -1).sum(dim=1).squeeze().tolist()
通过函数cross_entropy_list(),得到H(y|x),H(y|domain),H(y)。注意,我之前一直以为x和(x,domain)是两回事,通过代码发现其实x就是(x,domian)。这里H(y)的计算,是以常数序列25为输入计算的。
## get conditional CEs -P(y|x) P(the woman got her hair cut|The man perceived that the woman looked different because)
#-P(y|x) = H(x,y),H(x,y)实际上是H(model(x),y)
# 要计算P(y|x) 先得到x的预测输出 logits = model(inputs) 再计算预测值和真实值的交叉熵 cross_entropy(logits,target)
cond_ce = cross_entropy_list([opt['premise'] for opt in options],
[opt['hypothesis'] for opt in options],
model, cache=cache, batch=batch, calculate = calculate)
#{'premise': ' The man perceived that the woman looked different because', 'hypothesis': ' the woman got her hair cut.',
## get domain conditional CEs -P(y|domain) P(the woman got her hair cut|because)
domain_cond_ce = cross_entropy_list([opt['uncond_premise'] for opt in options],
[opt['uncond_hypothesis'] for opt in options],
model, cache=cache, batch=batch, calculate = calculate)
# 'uncond_premise': ' because', 'uncond_hypothesis': ' the woman got her hair cut.'
## get unconditional CEs -P(y)
#计算常数序列25和uncond_hypothesis直接的交叉熵 P(the woman got her hair cut|[25,25])
uncond_ce = cross_entropy_list([[25] for opt in options],
[opt['uncond_hypothesis'] for opt in options],
model, cache=cache, batch=batch, calculate = calculate)
通过交叉熵计算得到PMIDC和PMI:
#PMIDC= H(y|domain)-H(y|x)
dcpmi = [ce_0 - ce_1 for ce_0,ce_1 in zip(domain_cond_ce, cond_ce)]
#PMI(x,y)=H(y)-H(y|x) H(y):随机变量y的熵;H(y|x):给定x的条件下y的熵,cond_ce;
pmi = [ce_0 - ce_1 for ce_0,ce_1 in zip(uncond_ce, cond_ce)]
最后得到预测值:
#根据条件交叉熵的最小值确定索引
lm_pred = cond_ce.index(min(cond_ce))
lm_avg_pred = avg_cond_ce.index(min(avg_cond_ce))
lm_domain_cond_pred = domain_cond_ce.index(min(domain_cond_ce))
#根据领域点互信息的最大值确定索引
dcpmi_pred = dcpmi.index(max(dcpmi))
pmi_pred = pmi.index(max(pmi))