数据增强范畴
和Open Relation and Event Type Discovery with Type Abstraction这篇文章,有些类似
文章目录
background
In the few shot scenarios, increase the size and diversity of the training data.
Recent study:(1)Rule based method (2) the potential of leveraging data from high-resource tasks.
core idea: change the style related attributes of text while its semantics.
the paper formulated the task as paraphrase generation problem.
一、Model
According to the data parallel or not, the paper proposes two way to reslove the question.
For the parallel data, use the paraphrase generation model to brige the gap between the source and target.
For the nonparallel data, use a cycle consistent reconstruction to re-paraphrase back the paraphrased sentences to its original style(有点绕呀)
model structure
paraphrase generation
Two loss function:
L pg:对NER中的label做损失函数,判断模型预测的BIO标签是否正确。
L adv:对抗学习判断input和它的的释义之间的相似度。用Discrimination做的判断。
cycle-consistent reconstruction
Process
First: generator Gθ to generate the paraphrase y˜cycle of
the input sentence xcycle concatenated with a prefix.
Second: we concatenate the paraphrase y˜cycle
with a different prefix as the input to the generator
Gθ and let it transfer the paraphrase back to the
original sentence yˆcycle
Loss contains two part.
二、数据选择
即使有了有效的结构,生成的句子仍然可能不可靠,因为它可能因为退化的重复和不连贯的胡言乱语而质量不高(Holtzman等人,2020;Welleck等人,2020)。为了缓解这个问题,我们进一步用以下指标进行数据选择。
- 一致性:来自预训练的 style classifier的confidence score,作为生成句子在target style中的程度。
- 适当性:由预先训练好的NLU模型对生成的句子保留多少语义进行的信心评分。
- 流畅性:来自预训练的NLU模型的信心分数,表明生成的句子的流畅性。
- 多样性:原始句子和生成句子之间在字符层面上的编辑距离。
对于每个句子,我们过度生成k=10个候选人。我们计算上述指标(详见附录C),并将这些指标的加权分数分配给每个候选人。然后我们用这个分数对所有的候选者进行排名,并选择最好的一个来训练NER系统。
实验
- 不同的数据增强策略模型的效果。
The source data involves five different domains in the formal style: broadcast conversation (BC), broadcast news (BN), magazine (MZ),newswire (NW), and web data (WB) while the target data involves only social media (SM) domain in the informal style
- 不同因素在对实验性能的影响(消融研究)
总结
不觉得繁杂吗?相比于提问题做数据增强,这个方法还要分parallel与否,分开建立模型。