DCRE噪声检测的实现

Nestaldini

已于 2022-02-28 02:59:38 修改

阅读量780

点赞数

文章标签： python 机器学习自然语言处理

于 2022-02-28 02:45:15 首次发布

本文链接：https://blog.csdn.net/Nestaldini/article/details/123173830

版权

文章目录

匹配度的计算

目前已找到句子与标签的匹配度位于opennre.model.BagAttention类中的forward()函数中的以下语句计算得出：

# (nsum) ———— 其中的元素即文献[1]"Noise Detector"中的匹配度a_i，每个句子都有一个匹配度
att_score = (rep * att_mat).sum(-1)

而归一化的匹配度由如下语句计算得出：

# (n)  softmax标准化的匹配度
softmax_att_score = self.softmax(att_score[scope[i][0]:scope[i][1]])

其中，标准化是在句袋的范围内做标准化的。

计算句袋内所有匹配度的Z值，则需要添加如下代码：

z_att_score = (this_att_score - this_att_score.mean()) / this_att_score.std()

由此，就能依据匹配度的Z值做筛选。

`query`变量？

在文献[1]的图1中提及了Query。而在程序中，获取标签表征时，也用到了query变量来找到句子相应的标签表征。

在这里插入图片描述

阅读OpenNRE的代码发现，在OenNRE中，query变量中存的是各个样本的标签(联系)对应的id，用于从self.fc.weight中获取与标签id相对应的标签表征。

在文献[1]的“Label Generator”部分提到：

标签生成器需要输入与训练的联系矩阵 $\mathbf{L}$ 作为输入之一。

设想：

聚类后将结果存入文件，然后在联系抽取程序中读取该文件存入内存，从中查询聚类签？

🌟选出噪声样本后，调用已训练出的聚类模型的测试函数，将噪声样本作为输入，即得机器猜测签，然后以此替换原签

匹配度分布性的分析

目前已保存出一稿句子与标签的匹配度值，和softmax标准化的匹配度值，已存入文件“att_score.log”和“softmax_att_score.logs”。

待分析原始匹配度值的分布，并依据分布情况来进一步拟写噪声样本筛选的具体规则。

噪声样本检测规则

有效样本：

匹配度Z值 > 0.5

噪声样本：

匹配度Z值 < -0.5

Debug

RuntimeError 1

发生异常: RuntimeError
The size of tensor a (690) must match the size of tensor b (47) at non-singleton dimension 0

出错时：

batch.unsqueeze(1).shape = torch.Size([690, 1])

RuntimeError 2

发生异常: RuntimeError
CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
  File "/mnt/d/SourceCode/myNRE-Noise-detect/opennre/model/bag_denoise.py", line 146, in forward
    bag_mat = rep[scope[i][0]:scope[i][1]]                                  # (n, H) 句袋表征矩阵，句子个数 * H
  File "/mnt/d/SourceCode/myNRE-Noise-detect/opennre/framework/bag_re.py", line 141, in train_model
    logits, noise_loss = self.model(label, scope, *args, bag_size=self.bag_size, dec_model = dec_model)    # (B, C) 最后一个全连接层的输出
  File "/mnt/d/SourceCode/myNRE-Noise-detect/example/noise_detector.py", line 230, in main
    framework.train_model(args.metric, dec_model, tensorboardConfig=visualConfig)
  File "/mnt/d/SourceCode/myNRE-Noise-detect/example/noise_detector.py", line 268, in <module>
    main()

可能原因：数组越界

出错位置1：文件“opennre/model/bag_denoise.py”

# 计算出噪声样本被分类为$y_{j}$的概率
noise_y_j_logist = instance_logit[non_zero_index, new_labels]   # (n')

出错位置2：

Traceback (most recent call last):
  File "example/noise_detector.py", line 270, in <module>
    main()
  File "example/noise_detector.py", line 232, in main
    framework.train_model(args.metric, dec_model, tensorboardConfig=visualConfig)
  File "/home/mist/myNRE-Noise-detect/opennre/framework/bag_re.py", line 141, in train_model
    logits, noise_loss = self.model(label, scope, *args, bag_size=self.bag_size, dec_model = dec_model)    # (B, C) 最后一个全连接层的输出
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 166, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/mist/myNRE-Noise-detect/opennre/model/bag_denoise.py", line 149, in forward
    bag_mat = rep[scope[i][0]:scope[i][1]]                                  # (n, H) 句袋表征矩阵，句子个数 * H
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

变量值：

z_att_score.shape = torch.Size([7])

noise_mat.shape = torch.Size([1])
new_labels.shape = torch.Size([1])
instance_logit.shape = torch.Size([7, 58])
non_zero_index.shape = torch.Size([1])

由此可知，可能是instance_logit[non_zero_index, new_labels]引发了数组越界。可能是non_zero_index > bag_size - 1，也可能是new_labels > num_class - 1。

首先检查变量non_zero_index，该变量中存的是new_labels中非0的值的索引。

猜测：如果噪声样本全部被分类为NA，那么变量non_zero_index中就没有元素，此时调用instance_logit[non_zero_index, new_labels]是否会报错？验证：

import torch

# 将变量new_labels初始化为1维torch.Tensor
new_labels = torch.Tensor([0])

print(new_labels)

# (nsum, num_class)
instance_logit = torch.Tensor([
    [0.1, 0.2, 0.3, 0.4, 0.5, 0.6],
    [0.1, 0.2, 0.3, 0.4, 0.5, 0.6],
    [0.1, 0.2, 0.3, 0.4, 0.5, 0.6],
])

print(instance_logit)

# 找出new_labels中非0的值的索引
non_zero_index = torch.nonzero(new_labels).squeeze(1)   # 被聚为“NA”的样本须排除掉

print(non_zero_index)

print(instance_logit[non_zero_index, 2])

输出：

tensor([0.])
tensor([[0.1000, 0.2000, 0.3000, 0.4000, 0.5000, 0.6000],
        [0.1000, 0.2000, 0.3000, 0.4000, 0.5000, 0.6000],
        [0.1000, 0.2000, 0.3000, 0.4000, 0.5000, 0.6000]])
tensor([], dtype=torch.int64)
tensor([])

并不会报错。

添加如下代码，找出是否存在越界的情况：

if torch.nonzero(non_zero_index > (instance_logit.size(0) - 1)).size(0) != 0:
  print("OUT OF RANGE!")
  print(non_zero_index)
  print(instance_logit.size(0))
if torch.nonzero(new_labels > (instance_logit.size(1) - 1)).size(0) != 0:
  print("OUT OF RANGE!")
  print(new_labels)
  print(instance_logit.size(1))

追溯变量non_zero_index值的变化：

# 找出new_labels中非0的值的索引
non_zero_index = torch.nonzero(new_labels).squeeze(1)   # 被聚为“NA”的样本须排除掉

其中的变量new_labels存的是噪声样本的聚类签，且此时并未排除掉值为“0”的元素。

错误原因：new_labels的值超出了instance_logit.size(0)的范围。原因是输入聚类模型的表征形状应该是 $(n, N)$ ，出错时输入的维度却是 $(H)$ 。这是因为当只有一个噪声样本时，取该噪声样本的语句有问题，取出的是一维向量而不是二维矩阵。

参考文献

[1] Shang Y, Huang H Y, Mao X L, et al. Are noisy sentences useless for distant supervised relation extraction?[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2020, 34(05): 8799-8806.

Nestaldini

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
2
评论
DCRE噪声检测的实现

文章目录匹配度的计算`query`变量？匹配度分布性的分析噪声样本检测规则DebugRuntimeError 1RuntimeError 2参考文献匹配度的计算目前已找到句子与标签的匹配度位于opennre.model.BagAttention类中的forward()函数中的以下语句计算得出：# (nsum) ———— 其中的元素即文献[1]"Noise Detector"中的匹配度a_i，每个句子都有一个匹配度att_score = (rep * att_mat).sum(-1) 而归一化的
复制链接

扫一扫