前言
当前大模型已经走在当前科技的前言,在人工智能领域不谈大模型好像不够"高大上",虽然训练大模型需要数据(原料),需要设备(高性能GPU/TPU/NPU),需要工艺(算法),需要技术人员(算法工程师)等等,需要大量的资源投入。对于一般用户而言,投入产出不成正比,而许多工作需要借助AI来提升效率或提供便利;但这并不意味着我们需要自己去完整的实现全部内容,当前huggingface就提供了许多已经训练好的大模型,我们可以使用这种大模型构建自己的AI应用,借助已有的大模型进行微调等来满足自己的需求,在此分享一个简单的使用大模型进行英语翻译的案例。
一、英文段落拆分
定义一个类,用于将英文段落拆分成句子。
代码如下:
class Ensplit(object):
def __init__(self, delimiter = None):
self.delimiter_list = self.__add_delimiter__(
delimiter = delimiter
)
def __add_delimiter__(self, delimiter):
default_delimiter = [".", "?", "!"]
if delimiter is not None:
for i in delimiter:
if i not in default_delimiter:
default_delimiter.append(i)
return default_delimiter
def split(self, original_string):
new_string = copy.copy(x = original_string)
for sep in self.delimiter_list:
new_string = new_string.replace(sep, "{}\+/".format(sep))
split_list = new_string.split("\+/")
return [
i for i in split_list if i != ""
]
受模型输入的限制,并不能将整篇英文丢给大模型,因此需要拆分,此案例仅以最简单的方式将英文按照标点符号将一段进行拆分,这样会丢失段落的上下联系,对整篇文献的翻译准确性造成影响,可以考虑将前一句的后面一部分、后一句的前面一部分和当前句一起作为输入进行翻译,这样能保留一部分上下文的语义联系,但会增加模型的计算量,计算效率和响应速度都会收到影响,同时对翻译后的结果组织增加难度;段落的拆分方式有很多,此处为方便讲述,以比较简单的方式进行拆分
二、使用大模型进行翻译
代码如下:
text = "Massive key performance indicators (KPIs) are monitored as multivariate time series data (MTS) to ensure the reliability of the software applications and service system. Accurately detecting the abnormality of MTS is very critical for subsequent fault elimination. The scarcity of anomalies and manual labeling has led to the development of various self-supervised MTS anomaly detection (AD) methods, which optimize an overall objective/loss encompassing all metrics’ regression objectives/losses. However, our empirical study uncovers the prevalence of con icts among metrics’ regression objectives, causing MTS models to grapple with dierent losses. This critical aspect signicantly impacts detection performance but has been overlooked in existing approaches. To address this problem, by mimicking the design of multi-gate mixture-of-experts (MMoE), we introduce CAD, a Con ict-aware multivariate KPI Anomaly Detection algorithm. CAD oers an exclusive structure for each metric to mitigate potential con icts while fostering inter-metric promotions. Upon thorough investigation, we nd that the poor performance of vanilla MMoE mainly comes from the input-output misalignment settings of MTS formulation and convergence issues arising from expansive tasks. To address these challenges, we propose a straightforward yet eective task-oriented metric selection and p&s (personalized and shared) gating mechanism, which establishes CAD as the rst practicable multi-task learning (MTL) based MTS AD model. Evaluations on multiple public datasets reveal that CAD obtains an average F1-score of 0.943 across three public datasets, notably outperforming state-of-the-art methods."
import copy
from transformers import MarianMTModel, MarianTokenizer
翻译用大模型存储地址
model_name = "./opus-mt-en-zh"
加载模型及分词器
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
实例化文本拆分
split_class = Ensplit()
split_list = split_class.split(
original_string = text
)
for i in split_list:
# 将文本编码为模型可以处理的格式
translated = tokenizer(i, return_tensors = "pt", padding = True, truncation = True)
# 使用模型进行翻译
translated_tokens = model.generate(**translated)
# 解码生成的序列为可读文本
translated_text = tokenizer.decode(
translated_tokens[0],
skip_special_tokens = True
)
print(f"原文: {i}")
print("-----------------")
print(f"译文: {translated_text}")
选择翻译大模型,英文到中文的,可在https://hf-mirror.com/Helsinki-NLP/opus-mt-en-zh下载,
总结
本文仅简单介绍使用开源大模型将长篇幅英文进行翻译的简单应用案例,作为笔记,方便后续学习以供参考。