基于注意力机制的机器翻译系统搭建实战

雨下成一朵花

已于 2023-06-07 12:55:58 修改

阅读量393

点赞数 1

文章标签：人工智能自然语言处理机器翻译 tensorflow chatgpt

于 2023-06-06 09:53:15 首次发布

本文链接：https://blog.csdn.net/zql1009/article/details/131061198

版权

项目环境依赖：

1、tensorflow=2.12.0, matplotlib=3.7.1

2、安装sklearn：pip install scikit-learn==1.10.1

3、jieba=0.42.1

4、安装paddle=2.4.2：

python -m pip install paddlepaddle -f https://paddlepaddle.org.cn/whl/stable.html

步骤：

1、创建虚拟环境, 并进入虚拟环境（tensorflownlp）

执行代码，在cmd里面补全未安装的库，同时调试修改部分代码

conda create -m tensorflownlp python=3.9.16

2、安装tensorflow，matplotlib, jieba, paddle

pip install tensorflow==2.12.0 -i https://pypi.doubanio.com/simple

pip install matplotlib==3.7.1

pip install scikit-learn==1.10.1

pip install pandas==1.16.0

python -m pip install paddlepaddle==2.4.2 -f https://paddlepaddle.org.cn/whl/stable.html

pip install jieba==0.42.1

2、以管理员权限打开pycharm，导入工程项目

导入项目 cmn数据，数据处理模型，模型配置，注意力模型，测试代码

训练时，保存模型的路径文件夹training_checkpoint, 保存了模型ckpt, 同时保存了checkpoint项目即时信息。

训练50 epochs，耗时十几个小时

3、模型架构、解读

数据预处理

# 调用预处理方法，并返回这样格式的句子对：[chinese, english]
def create_dataset(path, num_examples):
    lines = io.open(path, encoding='UTF-8').read().strip().split('\n')
    word_pairs = [[preprocess_sentence(w) for w in l.split('\t')[0:2]]  for l in lines[:num_examples]]
    return zip(*word_pairs)
#判断词序列长度
def max_length(tensor):
    return max(len(t) for t in tensor)
#词符化
def tokenize(lang):
    lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='')
    lang_tokenizer.fit_on_texts(lang)
    tensor = lang_tokenizer.texts_to_sequences(lang)
    tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor,padding='post')
    return tensor, lang_tokenizer
# 创建清理过的输入输出对
def load_dataset(path, num_examples=None):
    targ_lang, inp_lang = create_dataset(path, num_examples)
    input_tensor, inp_lang_tokenizer = tokenize(inp_lang)
    target_tensor, targ_lang_tokenizer = tokenize(targ_lang)
    return input_tensor, target_tensor, inp_lang_tokenizer, targ_lang_tokenizer
#格式化显示字典内容
def convert(lang, tensor):
      for t in tensor:
        if t!=0:
            print("%d ----> %s" % (t, lang.index_word[t]))
if __name__=="__main__":
    num_examples = 100
    #读取中英互译文件
    path_to_file = 'cmn.txt'
    print('英文预处理效果')
    print('转换前:'+'he is a "Editor-in-Chief".')
    print('转换后:'+ preprocess_sentence('he is a "Editor-in-Chief".'))
    print('中文预处理效果')
    print('转换前:'+'人工智能程序员这种职业太*&￥%的厉害了！?Are you ok')
    print('转换后:'+ preprocess_sentence('人工智能程序员这种职业太*&￥%的厉害了！?Are you ok'))
    en,chs = create_dataset(path_to_file, num_examples)
    print('处理后的文本数据集示例：')
    print(en)
    print(chs)
    # 为了快速演示，先处理num_examples条数据集
    input_tensor, target_tensor, inp_lang, targ_lang = load_dataset(path_to_file, num_examples)
    # 计算目标张量的最大长度 （max_length）
    max_length_targ, max_length_inp = max_length(target_tensor), max_length(input_tensor)
    # 采用 80 - 20 的比例切分训练集和验证集
    input_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val = train_test_split(input_tensor, target_tensor, test_size=0.2)
    # 显示长度
    print(len(input_tensor_train), len(target_tensor_train), len(input_tensor_val), len(target_tensor_val))
    print('经过编码后的源语言（中文）张量数据集示例：')
    print(input_tensor)
    print('源语言（中文）字典内的单词编码：')
    print(inp_lang.word_index)
    print('格式化显示一条源语言（中文）字典内的单词编码：')
    convert(inp_lang, input_tensor_train[20])
    print('经过编码后的目标语言（英文）张量数据集示例：')
    print(target_tensor)
    print('目标语言（英文）字典内的单词编码：')
    print(targ_lang.word_index)
    print('格式化显示一条目标语言（英文）字典内的单词编码：')
    convert(targ_lang, target_tensor_train[20])
    #创建一个tf.data数据集
    BUFFER_SIZE = len(input_tensor_train)
    BATCH_SIZE = 64
    dataset = tf.data.Dataset.from_tensor_slices((input_tensor_train, target_tensor_train)).shuffle(BUFFER_SIZE)
    dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)
    example_input_batch, example_target_batch = next(iter(dataset))
    print('数据集尺寸：')
    print(example_input_batch.shape, example_target_batch.shape)
    print('一个批次的输入数据：')
    print(example_input_batch)
    print('一个批次的目标数据：')
    print(example_target_batch)