简历项目_实现西班牙语翻译为英语的机器翻译模型

本文链接：https://blog.csdn.net/weixin_41858806/article/details/132286542

项目名称：实现西班牙语翻译为英语的机器翻译模型

1.项目简介

该项目旨在开发一种能够将西班牙语翻译为英语的模型，通过训练神经网络来识别数据模式并生成准确的翻译。

1.1 开发环境

TensorFlow + Keras + GRU + Sequence-to-Sequence + Attention Mechanism

1.2 个人职责

数据预处理、特征工程、模型搭建与训练、超参数微调。

1.3 技术要点

1.3.1 Tokenizer

通过Tokenizer对源语言和目标语言进行Word-Level的分词。

1.3.2 数据建模

通过Embedding对数据进行降维，使用GRU搭建Encoder，在Decoder中使用Bahdanau Attention。

1.3.3 优化器与损失函数

使用Adam优化器，自定义损失函数，使用梯度下降更新参数。

2.代码实现

# 1.项目基础配置
# 1.1 导入机器学习相关包
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import sklearn
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow import keras

# 1.2 导入系统相关包
import re
import os
import sys
import time
import unicodedata

# 1.3 忽略所有警告信息
import warnings
warnings.filterwarnings("ignore")

# 1.4 导入谷歌云盘地址
from google.colab import drive
drive.mount('/content/drive')
print("当前云盘包含文件:")
!ls /content/drive/MyDrive/data

# 2.查看设备信息
# 2.1 查看框架版本
print("TensorFlow版本:", tf.__version__)
# 2.2 检查是否有可用的GPU设备
gpu_devices = tf.config.list_physical_devices('GPU')
if gpu_devices:
    print("GPU可用:", True)
else:
    print("GPU可用:", False)
# 2.3 检查GPU是否生效
visible_devices = tf.config.experimental.list_physical_devices('GPU')
if visible_devices:
    for device in visible_devices:
        print("GPU已经生效并可见:", device)
else:
    print("没有发现可见的GPU设备")

# 3.特殊字符处理
# 3.1 西班牙语有一些是特殊字符，所以我们需要unicode转ascii，这样值变小了，因为unicode太大
def unicode_to_ascii(sentence):
    return ''.join(c for c in unicodedata.normalize('NFD', sentence) if unicodedata.category(c) != 'Mn')

# 3.2 测试处理前的语句
en_sentence = u"May I borrow this book? "
sp_sentence = u"¿Puedo tomar prestado este libro?"
print("英文句子:", unicode_to_ascii(en_sentence))
print("西班牙句子:", unicode_to_ascii(sp_sentence))

# 4.句子处理
def preprocess_sentence(w):
    # 4.1 所有句子变为小写，去掉多余的空格
    w = unicode_to_ascii(w.lower().strip())
    # 4.2 在单词与跟在其后的标点符号之间插入一个空格，即在标点符号的前后都放入一个空格
    w = re.sub(r"([?.!,¿])", r" \1 ", w)
    # 4.3 因为可能有多余空格，所以处理一下，一个或者多个空格都替换为1个空格
    w = re.sub(r'[" "]+', " ", w)
    # 4.4 除了 (a-z, A-Z, ".", "?", "!", ",")，将所有字符替换为空格，可以减少词典大小
    w = re.sub(r"[^a-zA-Z?.!,¿]+", " ", w)
    # 4.5 把一个样本前后的空格去掉
    w = w.rstrip().strip()
    # 4.6 添加开始和结束标签
    w = '<start> ' + w + ' <end>'
    return w

# 4.7 测试处理后的语句
print("英文句子:", preprocess_sentence(en_sentence))
print("西班牙句子:", preprocess_sentence(sp_sentence))

# 5.导入数据
# 5.1 配置数据地址
data_path = '/content/drive/MyDrive/data/spa.txt'

# 5.2 前面西班牙语，后面英文
def create_dataset(path, num_examples):
    lines = open(path, encoding='UTF-8').read().strip().split('\n')
    word_pairs = [[preprocess_sentence(w) for w in l.split('\t')] for l in lines[:num_examples]]
    return zip(*word_pairs)

# 5.3 选择样本，所有英语语料集放在en，西班牙语放在sp里边
en, sp = create_dataset(data_path, 30000)
print("英文样本长度:", len(en))
print("西班牙文样本长度:", len(sp))
print()
print("英文末尾样本:", en[-1])
print("西班牙文末尾样本:", sp[-1])

# 6.单词转换
# 6.1 获取句子最大长度
def max_length(tensor):
    return max(len(t) for t in tensor)
# 6.2 把单词变为id，同时增加padding
def tokenize(lang):
    <