恒源云(GPUSHARE)_[文本分类] 文本数据增强1(论文笔记)

置顶

AI酱油君

已于 2022-02-23 15:40:48 修改

阅读量574

点赞数 1

分类专栏：深度学习 AI行业新思文章标签：分类机器学习 python

于 2021-12-21 14:50:55 首次发布

本文链接：https://blog.csdn.net/weixin_53977063/article/details/122061184

版权

本文介绍了在文本分类任务中使用EDA（简单数据增强）和回译技术进行数据增强的方法，包括随机替换、随机插入、随机删除和随机置换临近词。通过预训练的mbart50模型进行中英中回译，以提升模型性能。

摘要由CSDN通过智能技术生成

文章来源 | 恒源云社区(恒源云，专注 AI 行业的共享算力平台)

原文地址 | 文本数据增强

原文作者 | 角灰

最近在做新闻标题分类,找了篇数据增强的文章学习学习:
一篇就够！数据增强方法综述
本文实现了EDA(简单数据增强)和回译:

一. EDA

1.1 随机替换

import random
import jieba
import numpy as np
import paddle
from paddlenlp.embeddings import TokenEmbedding
# 从词向量中按余弦相似度找与某个词的topk近义词
def get_similar_tokens_raw(query_token, k, token_embedding):
    W = np.asarray(token_embedding.weight.numpy())
    x = np.asarray(token_embedding.search(query_token).reshape(-1))
    cos = np.dot(W, x) / np.sqrt(np.sum(W * W, axis=1) * np.sum(x * x) + 1e-9)
    flat = cos.flatten()
    # argpartition在k个位置放第k大的索引，左边比他小，右边比他大,复杂度仅o(n)
    # 取-k则在-k和他右边的为topk,对他们再排次序就好了
    indices = np.argpartition(flat, -k)[-k:] 
    indices = indices[np.argsort(-flat[indices])] # 取负从大到小排
    return token_embedding.vocab.to_tokens(indices)
# 随机替换
def random_replace(words,token_embedding,prob=0.1,max_change=3):
    change_num=0
    for idx in range(len(words)):
        prob_i=prob*(len(words[idx])-0.5) # -0.5使得长度1的词概率乘2,不易选中
        if random.uniform(0,1)<prob_i: # 词越长，越容易被替换
            sim_words=get_similar_tokens_raw(words[idx],k=5,token_embedding=token_embedding)
            words[idx]=random.choice(sim_words)
            change_num+=1
        if change_num>=max_change:
            break
    return words

由于get_similar_tokens_raw一次只能取一个词的近义词较慢,于是改成了一次取多个词的近义词,效果如下:

# 查询多个词的topk近义词
def get_similar_tokens_multi(query_tokens, k, token_embedding):
    n_tokens=len(query_tokens)
    W = paddle.to_tensor(token_embedding.weight.detach(),dtype='float16')
    q_idx=token_embedding.search(query_tokens)
    x = paddle.to_tensor(q_idx,dtype='float16').transpose((1,0))
    cos = paddle.matmul(W, x) / paddle.sqrt(paddle.sum(W * W, axis=1,keepdim=True) * paddle.sum(x * x,keepdim=True) + 1e-9)

    def sort_row_by_idx(input, indices):
        assert input.shape == indices.shape
        row, col = input.shape
        indices = indices * col + np.arange(