【NLP】keras Transformer 进行机器翻译

最新推荐文章于 2023-03-27 23:20:19 发布

x66ccff

最新推荐文章于 2023-03-27 23:20:19 发布

阅读量891

点赞数 1

分类专栏： NLP 文章标签： nlp

本文链接：https://blog.csdn.net/qq_18846849/article/details/127491023

版权

NLP 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

目标：参考这篇文章进行中文 -> 英文的机器翻译，使用 CMN 微型中英数据集。
使用 keras-Transformer 第三方库进行搭建模型和训练
以下仅仅展示和这篇文章不同的部分代码

一、添加 < SOS > 和 < EOS > 标志

# 给每一句英文前面添加 <SOS>，后面添加 <EOS>
for i in range(len(string_en_ls)):
    string_en_ls[i] = '<SOS> ' + string_en_ls[i] + ' <EOS>'

string_en_ls[:10]

[‘< SOS> Hi. < EOS>’,
‘< SOS> Hi. < EOS>’,
‘< SOS> Run. < EOS>’,
‘< SOS> Stop! < EOS>’,
‘< SOS> Wait! < EOS>’,
‘< SOS> Wait! < EOS>’,
‘< SOS> Begin. < EOS>’,
‘< SOS> Hello! < EOS>’,
‘< SOS> I try. < EOS>’,
‘< SOS> I won! < EOS>’]

二、中文英文使用同一 token 数目

这是由于 keras-transformer 的模型限制，我们将中文和英文的 token 最大数目都设置为 7000

三、token_en 添加 < PAD > 标志

# 给tokenizer 添加 pad
tokenizer_en.word_index['<pad>'] = 0
tokenizer_en.index_word[0] = '<pad>'

# 查看英文数据集中的词频排序
print(len(tokenizer_en.word_index))
tokenizer_en.word_index

{‘< sos >’: 1,
‘< eos >’: 2,
‘the’: 3,
‘i’: 4,
‘to’: 5,
‘you’: 6,
…
‘york’: 997,
‘color’: 998,
‘useful’: 999,
‘supposed’: 1000,
…}

四、制作 decoder_output 数组

用于 transformer 模型的输出，原来的 y 作为模型的第二个输入位置

# 把 data_en_ls_20 的每一个元素都往左边移动一位，用来构造 decoder 的输出，末尾补 0 

data_en_ls_20_shift_1 = []
for i in range(len(data_en_ls_20)):
    data_en_ls_20_shift_1.append(data_en_ls_20[i][1:] + [0])

# 把 data_en_ls_20_shift_1 转化成 numpy 数组
data_en_mat_decoder = pad_sequences(data_en_ls_20_shift_1, maxlen=20, padding='post')

data_en_mat_decoder

五、划分训练集测试集

# 划分训练集测试集
from sklearn.model_selection import train_test_split
data_en_train, data_en_test, data_zh_train, data_zh_test ,data_en_mat_decoder_train , data_en_mat_decoder_test = train_test_split(data_en_mat, data_zh_mat, data_en_mat_decoder, test_size=0.3, random_state=42, shuffle=True)

六、建立模型和训练模型

from keras_transformer import get_model

# Build the model
model = get_model(
    token_num=7000,
    embed_dim=128,
    encoder_num=3,
    decoder_num=2,
    head_num=4,                    # embed_dim must be divisible by head_num
    hidden_dim=256,                # hidden_dim 没有要求
    attention_activation='relu',
    feed_forward_activation='relu',
    dropout_rate=0.1,
)
model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['sparse_categorical_accuracy']
)
model.summary()

在这里插入图片描述

# 对英文标签数据进行 reshape ，因为模型的输出是 3 维的，而标签数据是 2 维的
# data_en_train_reshape = data_en_train.reshape(data_en_train.shape[0], data_en_train.shape[1], 1)
# data_en_test_reshape = data_en_test.reshape(data_en_test.shape[0], data_en_test.shape[1], 1)

data_en_mat_decoder_train_reshape = data_en_mat_decoder_train.reshape(data_en_mat_decoder_train.shape[0], data_en_mat_decoder_train.shape[1], 1)
data_en_mat_decoder_test_reshape = data_en_mat_decoder_test.reshape(data_en_mat_decoder_test.shape[0], data_en_mat_decoder_test.shape[1], 1)

# 训练，训练 10 个 epoch，用_train 作为训练数据，用_test 作为测试数据
# 中文 -> 英文 翻译，所以 zh 是输入，en 是输出
model.fit([data_zh_train,data_en_train], data_en_mat_decoder_train_reshape, epochs=30, batch_size=128, validation_data=([data_zh_test,data_en_test], data_en_mat_decoder_test_reshape))

七、测试模型

使用 keras-transformer 库中的 decode 函数，对模型的输出进行解码

test_string = '你好！'

# 把 test_string 每个字之间添加空格
test_string = ' '.join(test_string)
# 前后加上开始和结束标志
# test_string = '<SOS> ' + test_string
# 把 test_string 转化为token
test_string_token = tokenizer_zh.texts_to_sequences([test_string])
# 截取前10个字
test_string_token = test_string_token[0][:10]
# 转化为 numpy，补齐
test_string_mat = pad_sequences([test_string_token],maxlen=10,padding='post',truncating='post')
test_string_mat

array([[ 5, 29, 69, 0, 0, 0, 0, 0, 0, 0]])

from keras_transformer import decode
decoded = decode(
    model,
    test_string_mat.tolist(),
    start_token=tokenizer_en.word_index['<sos>'],
    end_token=tokenizer_en.word_index['<eos>'],
    pad_token=tokenizer_en.word_index['<pad>'],
    max_len=100,
    top_k=4,                                # 添加 top_k 和 temperature 参数，增加随机性
    temperature=1.0,
)
decoded

1/1 [============================== ] - 0s 41ms/step
1/1 [============================== ] - 0s 48ms/step
1/1 [============================== ] - 0s 38ms/step
1/1 [============================== ] - 0s 38ms/step
[[1, 90, 58, 58, 2]]

token_dict_rev = {v: k for k, v in tokenizer_en.word_index.items()}
for i in range(len(decoded)):
    print(' '.join(map(lambda x: token_dict_rev[x], decoded[i][1:-1])))

you’re good good

这翻译很 chinglish.

x66ccff

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
【NLP】keras Transformer 进行机器翻译

目标：参考进行中文 -> 英文的机器翻译，使用 CMN 微型中英数据集。使用 keras-Transformer 第三方库进行搭建模型和训练以下仅仅展示和不同的部分代码。
复制链接

扫一扫