在进行序列标注时,使用crf报错:
‘mask of the first timestep must all be on’
搜索相关问题,给出的大多是将输入参数batch_first设置为True。但是我的代码中本来就是True。因此问题不在这里。
这个报错是指mask的第一个值错误,不应该为0。这里进行debug查看了具体出错位置:
ipdb> n
> <ipython-input-60-103a41bbe73a>(46)crf_neg_log_likelihood()
44 mask = mask.type(torch.uint8)
45
---> 46 crf_llh = self.crf(logits, tags, mask, reduction='mean') # Compute the conditional log likelihood of a sequence of tags given emission scores
47 # crf_llh = self.crf(logits, tags, mask) # Compute the conditional log likelihood of a sequence of tags given emission scores
48 return -crf_llh
ipdb> n
ValueError: mask of the first timestep must all be on
> <ipython-input-60-103a41bbe73a>(46)crf_neg_log_likelihood()
44 mask = mask.type(torch.uint8)
45
---> 46 crf_llh = self.crf(logits, tags, mask, reduction='mean') # Compute the conditional log likelihood of a sequence of tags given emission scores
47 # crf_llh = self.crf(logits, tags, mask) # Compute the conditional log likelihood of a sequence of tags given emission scores
48 return -crf_llh
说明这里输入的mask的第一位对应值错误,打印出来看看:
在这里插入图片描述
这里的false指是填充padding的内容。第一个tensor中,0表示padding,数字表示每个单词转换成的对应id。但是很明显第一个单词的id:0和用来表示填充的0冲突了,因此导致了这一问题:
修改后就好了:
word2id = {'<pad>': 0, '<unk>': 1}
for sentence in sentences: # 建立word到索引的映射
for word in sentence:
if word not in word2id:
word2id[word] = len(word2id)
print(word2id)
print(len(word2id))