读论文笔记1：Text Smoothing: Enhance Various Data Augmentation Methods on Text Classification Tasks

最新推荐文章于 2024-09-07 21:18:35 发布

IT蛮牛

最新推荐文章于 2024-09-07 21:18:35 发布

阅读量380

点赞数

分类专栏：论文笔记文章标签：自然语言处理 bert nlp

本文链接：https://blog.csdn.net/ifree001/article/details/126514502

版权

论文笔记专栏收录该内容

1 篇文章 0 订阅

订阅专栏

初读论文，不准确之处敬请谅解，欢迎指正

参考文献：[Text Smoothing: Enhance Various Data Augmentation Methods on Text Classification Tasks](https://aclanthology.org/2022.acl-short.97) (Wu et al., ACL 2022)

一.生词：

low-resource regime 低资源状态

data augmentation 数据增强

alleviate overfitting 缓解过拟合

semantically 语义

context augmentation 语境

LTSM 长短期记忆

sampling 采样

MLM 屏蔽语言模型

bi-directional 双向

contextual-compatible 上下文兼容

word embedding matrix 词的矩阵嵌入

sentiment classification 情感分类

supervision 监督

interpolate 插入

interpolation 插值

derived from 源于

interpolation operation 插补运算

prepending 预算

prompt 提示

部分术语：

one-hot representation： a vector of the vocabulary size with only one position is 1 while the rest positions are 0（一个词库大小的向量，只有一个位为1，其它位为0）

EDA：包含synonym replacement（同义词替换）、random insertion（随机插入）、random swap（随机交换）、random deletion（随机删除）四种运算

Back Translation:翻译过去再翻译回来

CBERT:用预先训练好的BERT获得语义替换

BERTexpand, BERTprepend：通过在给定类的所有示例前添加类标签来调整BERT

GPT2context：给预先训练好的GPT提示并且不断生成文本

BARTword, BARTspan：通过在给定类的所有示例前添加类标签来调整BART（BARTword标单个词，而BARTspan标连续的整块）

这篇文章讲述的是关于data augmentation（数据增强）的方法，名为text smoothing

通过converting a sentence from its one-hot representation to a controllable smoothed representation（将一个句子从独热编码表示转化为可控的平滑表示）（在低资源状态下进行）

实验代码的地址为：https://github.com/caskcsg/TextSmoothing.

实验原理图如下：

核心步骤：

We combine the two stages as text smoothing: obtaining a smooth representation through MLM and interpolating to constrain the representation more controllable.（获得smooth representation和插入其约束表达更加可控）

obtaining a smooth representation：