MarianMT进行文本数据增强

B战学习视频:使用MarianMT进行文本数据增强
在这里插入图片描述

pip install transformers==4.1.1 sentencepiece=0.1.94
pip install mosestokenizer=1.1.0
from transformers import MarianMTModel,MarianTokenizer

初始化模型,将英语翻译成罗曼语,

target_model_name='Helsinki-NLP/opus-mt-en-ROMANCE'
target_tokenizer=MarianTokenizer.from_pretrained(target_model_name)
target_model=MarianMTModel.from_pretrained(target_model_name)

初始化将法语翻译成英语的模型

en_model_name='Helsinki-NLP/opus-mt-ROMANCE-en'
en_tokenizer=MarianTokenizer.from_pretrained(en_model_name)
en_model=MarianMTModel.from_pretrained(en_model_name)

书写辅助函数,来翻译给定机器翻译模型

def translate(texts,model,tokenizer,language='fr')
	template=lambda text:f"{text}" if language == "en" else f">>{language}<<{text}"
	src_texts=[template(text) for text in texts]
	
	encoded=tokenizer.prepare_seq2seq_batch(src_texts)
	
	translated =model.generate(**encoded)
	
	translated_texts=tokenizer.batch_decode(translated,skip_special_tokens=True)
	
	return translated_texts

回译函数:

def back_translate(texts,source_lang="en",target_lang="fr")
	fr_texts=translate(texts,target_model,target_tokenizer,language=target_lang)

	back_translated_texts=translate(fr_texts,en_model,en_tokenizer,language=source_lang)

	return back_translated_texts

执行数据增强(英语到西班牙语)

en_texts =['This is so cool','I hated the food','They were very helpful']
aug_text=back_translate(en_texts,source_lang="en",target_lang="es")
print(aug_texts)
["Yeah,it's so cool.","It's the food I I hate.","They were of great help."]

使用英语到法语进行扩充:

en_texts =['This is so cool','I hated the food','They were very helpful']
aug_text=back_translate(en_texts,source_lang="en",target_lang="fr")
print(aug_texts)
["It's so cool.","I hate food.","They've been very helpful."]

如下命令查看所有可使用的增强语言

target_tokenizer.supported_language_codes
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值