匹配度的计算
目前已找到句子与标签的匹配度位于opennre.model.BagAttention
类中的forward()
函数中的以下语句计算得出:
# (nsum) ———— 其中的元素即文献[1]"Noise Detector"中的匹配度a_i,每个句子都有一个匹配度
att_score = (rep * att_mat).sum(-1)
而归一化的匹配度由如下语句计算得出:
# (n) softmax标准化的匹配度
softmax_att_score = self.softmax(att_score[scope[i][0]:scope[i][1]])
其中,标准化是在句袋的范围内做标准化的。
计算句袋内所有匹配度的Z值,则需要添加如下代码:
z_att_score = (this_att_score - this_att_score.mean()) / this_att_score.std()
由此,就能依据匹配度的Z值做筛选。
query
变量?
在文献[1]的图1中提及了Query
。而在程序中,获取标签表征时,也用到了query
变量来找到句子相应的标签表征。
阅读OpenNRE的代码发现,在OenNRE中,query变量中存的是各个样本的标签(联系)对应的id,用于从self.fc.weight
中获取与标签id相对应的标签表征。
在文献[1]的“Label Generator”部分提到:
标签生成器需要输入与训练的联系矩阵 L \mathbf{L} L作为输入之一。
设想:
聚类后将结果存入文件,然后在联系抽取程序中读取该文件存入内存,从中查询聚类签?
🌟选出噪声样本后,调用已训练出的聚类模型的测试函数,将噪声样本作为输入,即得机器猜测签,然后以此替换原签
匹配度分布性的分析
目前已保存出一稿句子与标签的匹配度值,和softmax标准化的匹配度值,已存入文件“att_score.log”和“softmax_att_score.logs”。
待分析原始匹配度值的分布,并依据分布情况来进一步拟写噪声样本筛选的具体规则。
噪声样本检测规则
有效样本:
匹配度Z值 > 0.5
噪声样本:
匹配度Z值 < -0.5
Debug
RuntimeError 1
发生异常: RuntimeError
The size of tensor a (690) must match the size of tensor b (47) at non-singleton dimension 0
出错时:
batch.unsqueeze(1).shape = torch.Size([690, 1])
RuntimeError 2
发生异常: RuntimeError
CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
File "/mnt/d/SourceCode/myNRE-Noise-detect/opennre/model/bag_denoise.py", line 146, in forward
bag_mat = rep[scope[i][0]:scope[i][1]] # (n, H) 句袋表征矩阵,句子个数 * H
File "/mnt/d/SourceCode/myNRE-Noise-detect/opennre/framework/bag_re.py", line 141, in train_model
logits, noise_loss = self.model(label, scope, *args, bag_size=self.bag_size, dec_model = dec_model) # (B, C) 最后一个全连接层的输出
File "/mnt/d/SourceCode/myNRE-Noise-detect/example/noise_detector.py", line 230, in main
framework.train_model(args.metric, dec_model, tensorboardConfig=visualConfig)
File "/mnt/d/SourceCode/myNRE-Noise-detect/example/noise_detector.py", line 268, in <module>
main()
可能原因:数组越界
出错位置1:文件“opennre/model/bag_denoise.py”
# 计算出噪声样本被分类为$y_{j}$的概率
noise_y_j_logist = instance_logit[non_zero_index, new_labels] # (n')
出错位置2:
Traceback (most recent call last):
File "example/noise_detector.py", line 270, in <module>
main()
File "example/noise_detector.py", line 232, in main
framework.train_model(args.metric, dec_model, tensorboardConfig=visualConfig)
File "/home/mist/myNRE-Noise-detect/opennre/framework/bag_re.py", line 141, in train_model
logits, noise_loss = self.model(label, scope, *args, bag_size=self.bag_size, dec_model = dec_model) # (B, C) 最后一个全连接层的输出
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 166, in forward
return self.module(*inputs[0], **kwargs[0])
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/mist/myNRE-Noise-detect/opennre/model/bag_denoise.py", line 149, in forward
bag_mat = rep[scope[i][0]:scope[i][1]] # (n, H) 句袋表征矩阵,句子个数 * H
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
变量值:
z_att_score.shape = torch.Size([7])
noise_mat.shape = torch.Size([1])
new_labels.shape = torch.Size([1])
instance_logit.shape = torch.Size([7, 58])
non_zero_index.shape = torch.Size([1])
由此可知,可能是instance_logit[non_zero_index, new_labels]
引发了数组越界。可能是non_zero_index > bag_size - 1
,也可能是new_labels > num_class - 1
。
首先检查变量non_zero_index
,该变量中存的是new_labels中非0的值的索引。
猜测:如果噪声样本全部被分类为NA,那么变量non_zero_index
中就没有元素,此时调用instance_logit[non_zero_index, new_labels]
是否会报错?验证:
import torch
# 将变量new_labels初始化为1维torch.Tensor
new_labels = torch.Tensor([0])
print(new_labels)
# (nsum, num_class)
instance_logit = torch.Tensor([
[0.1, 0.2, 0.3, 0.4, 0.5, 0.6],
[0.1, 0.2, 0.3, 0.4, 0.5, 0.6],
[0.1, 0.2, 0.3, 0.4, 0.5, 0.6],
])
print(instance_logit)
# 找出new_labels中非0的值的索引
non_zero_index = torch.nonzero(new_labels).squeeze(1) # 被聚为“NA”的样本须排除掉
print(non_zero_index)
print(instance_logit[non_zero_index, 2])
输出:
tensor([0.])
tensor([[0.1000, 0.2000, 0.3000, 0.4000, 0.5000, 0.6000],
[0.1000, 0.2000, 0.3000, 0.4000, 0.5000, 0.6000],
[0.1000, 0.2000, 0.3000, 0.4000, 0.5000, 0.6000]])
tensor([], dtype=torch.int64)
tensor([])
并不会报错。
添加如下代码,找出是否存在越界的情况:
if torch.nonzero(non_zero_index > (instance_logit.size(0) - 1)).size(0) != 0:
print("OUT OF RANGE!")
print(non_zero_index)
print(instance_logit.size(0))
if torch.nonzero(new_labels > (instance_logit.size(1) - 1)).size(0) != 0:
print("OUT OF RANGE!")
print(new_labels)
print(instance_logit.size(1))
追溯变量non_zero_index
值的变化:
# 找出new_labels中非0的值的索引
non_zero_index = torch.nonzero(new_labels).squeeze(1) # 被聚为“NA”的样本须排除掉
其中的变量new_labels
存的是噪声样本的聚类签,且此时并未排除掉值为“0”的元素。
错误原因:new_labels
的值超出了instance_logit.size(0)
的范围。原因是输入聚类模型的表征形状应该是
(
n
,
N
)
(n, N)
(n,N),出错时输入的维度却是
(
H
)
(H)
(H)。这是因为当只有一个噪声样本时,取该噪声样本的语句有问题,取出的是一维向量而不是二维矩阵。
参考文献
[1] Shang Y, Huang H Y, Mao X L, et al. Are noisy sentences useless for distant supervised relation extraction?[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2020, 34(05): 8799-8806.