使用MLP模型实现姓氏分类任务
1.实验内容简介
使用MLP(多层感知机)模型实现根据姓氏预测国籍的分类任务。
2.多层感知机
在正式进入实验的主要内容前,先简要介绍本次实验用到的模型——多层感知机。
2.1感知机
感知机(Perceptron)由两层神经元组成,如图1。
- 它只包含一个输入层、一个输出层,没有隐藏层。
- 其简单的模型结构决定了其决策边界是一个线性函数,因此它只能解决线性可分的问题,即可以通过一个超平面将数据分开的问题。如果数据不是线性可分的(如异或(XOR)问题),单层感知机无法找到一个合适的超平面来正确分类数据。
- 感知机的输出公式可以表示为:
y = f ( ∑ i = 1 n ω i x i − θ ) y=f\left (\sum_{i=1}^{n}\omega _{i}x_{i}- \theta \right ) y=f(i=1∑nωixi−θ)
其中 f ( ⋅ ) f( \cdot ) f(⋅)为激活函数。
2.2多层感知机
多层感知机(Multilayer Perceptron)是感知机的拓展。它通常包括一个输入层、一个输出层以及一个或多个隐藏层,如图2。
其基本原理如下:
- 输入与权重:每一层的输入是上一层的输出,第一层的输入是原始数据。每一对神经元的连接都有相应的权重。权重也是后续模型训练的反向传播过程中所要不断学习优化的目标。
- 非线性激活函数:每个神经元通过一个非线性激活函数(如 ReLU、sigmoid、tanh)产生输出。非线性激活函数使得神经网络能够对输入数据进行非线性变换。
- 逐层处理:经过每一层的处理,数据逐渐被映射到更高维的特征空间,从而捕捉到输入数据的复杂模式。
- MLP的输出公式可以表示为:
y = f L ( W L ⋅ f L − 1 ( W L − 1 ⋅ . . . f 1 ( W 1 ⋅ x + b 1 ) . . . + b L − 1 ) + b L ) y=f_{L} \left ( W_{L}\cdot f_{L-1}\left (W_{L-1}\cdot ...f_{1}\left ( W_{1}\cdot x+ b_{1} \right )...+ b_{L-1} \right )+b_{L} \right ) y=fL(WL⋅fL−1(WL−1⋅...f1(W1⋅x+b1)...+bL−1)+bL)
其中其中 f i f_{i} fi 是第i层的激活函数, W i W_{i} Wi 是第i层的权重矩阵, b i b_{i} bi是第i层的偏置向量。
2.3多层感知机(MLP)的训练流程
-
数据预处理:
对输入数据进行标准化或归一化,以消除量纲影响,加速收敛。并将数据集划分为训练集、验证集和测试集。 -
模型初始化:
定义模型结构,初始化模型参数。 -
定义损失函数:
选择合适的损失函数来衡量模型预测值与真实值之间的差异。其中,分类任务常用交叉熵损失,回归问题常用均方误差。 -
选择优化器:
优化器在模型训练中的核心作用是通过迭代地调整模型参数来最小化(或最大化)损失函数,进而提升模型在训练数据上的表现,并期望模型能够泛化到未见数据上。常用的优化器有SGD(随机梯度下降)、Adam等 -
开始训练:
迭代开始,每一轮的训练过程一般包括前向传播、计算损失、反向传播、参数更新四个阶段,直到达到预设的训练轮数或满足早停条件。
模型每次迭代时接收一个批次的输入数据及标签,经过各层的线性变换和非线性激活函数,产生预测输出。随后比较预测输出与真实标签,计算损失函数的值。再计算损失相对于模型参数的梯度,根据计算结果与优化器的规则,更新模型中的参数,不断减小损失。 -
验证与调整:
在训练过程中或结束后可以穿插使用验证集评估模型性能,监控过拟合或欠拟合现象。并根据验证结果调整模型结构或超参数。 -
测试:
在测试集上评估模型,考察模型的泛化能力。
2.4MLP应用的简单举例
-
问题描述:
如图3所示,平面空间内有两种分散分布的数据点,一种为圆形,另一种为星形。现尝试训练一个MLP模型对数据点进行划分。
-
模型定义
class MultilayerPerceptron(nn.Module):
"""
"""
#初始化模型参数
def __init__(self, input_size, hidden_size=2, output_size=3,
num_hidden_layers=1, hidden_activation=nn.Sigmoid):
"""Initialize weights.
Args:
input_size (int): size of the input #输入向量的大小
hidden_size (int): size of the hidden layers #每个隐藏层的大小
output_size (int): size of the output #输出向量的大小
num_hidden_layers (int): number of hidden layers #隐藏层的数量
hidden_activation (torch.nn.*): the activation class #隐藏层使用的激活函数
"""
super(MultilayerPerceptron, self).__init__()
#创建一个模块列表,用于存储网络中的层
self.module_list = nn.ModuleList()
#初始化输入和输出大小
interim_input_size = input_size
interim_output_size = hidden_size
#根据隐藏层数量构建隐藏层
for _ in range(num_hidden_layers):
self.module_list.append(nn.Linear(interim_input_size, interim_output_size))
self.module_list.append(hidden_activation())
#更新给下一层的输出大小
interim_input_size = interim_output_size
self.fc_final = nn.Linear(interim_input_size, output_size)
self.last_forward_cache = []
#前行传播
def forward(self, x, apply_softmax=False):
"""The forward pass of the MLP
Args:
x_in (torch.Tensor): an input data tensor.
x_in.shape should be (batch, input_dim)
apply_softmax (bool): a flag for the softmax activation
should be false if used with the Cross Entropy losses
Returns:
the resulting tensor. tensor.shape should be (batch, output_dim)
"""
#清空缓存
self.last_forward_cache = []
#将输入张量复制到缓存中
self.last_forward_cache.append(x.to("cpu").numpy())
#依次通过模块列表中的每一层
for module in self.module_list:
x = module(x)
#将每层的输出复制到缓存中
self.last_forward_cache.append(x.to("cpu").data.numpy())
#通过最后的线性层
output = self.fc_final(x)
#将最终输出复制到缓存中
self.last_forward_cache.append(output.to("cpu").data.numpy())
#按需使用softmax函数
if apply_softmax:
output = F.softmax(output, dim=1)
return output
- 实例化含有一个隐藏层的MLP模型
#模型参数
input_size = 2
output_size = len(set(LABELS))
num_hidden_layers = 1 #一个隐藏层
hidden_size = 2
seed = 2
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
np.random.seed(seed)
#实例化两层MLP模型
mlp2 = MultilayerPerceptron(input_size=input_size,
hidden_size=hidden_size,
num_hidden_layers=num_hidden_layers,
output_size=output_size)
#打印模型信息
print(mlp2)
batch_size = 1000
x_data_static, y_truth_static = get_toy_data(batch_size)
fig, ax = plt.subplots(1, 1, figsize=(10,5))
visualize_results(mlp2, x_data_static, y_truth_static,
ax=ax, title='Initial 2-Layer MLP State', levels=[0.5])
plt.axis('off')
plt.savefig('images/mlp2_initial.png');
- 训练参数设置、早停条件设置、开始训练
losses = [] #记录训练中的损失值
batch_size = 10000 #批处理大小
n_batches = 10
max_epochs = 15 #最大训练轮次
#训练参数
loss_change = 1.0
last_loss = 10.0
change_threshold = 1e-5
epoch = 0
all_imagefiles = []
lr = 0.01 #学习率
#Adam梯度下降算法
optimizer = optim.Adam(params=mlp2.parameters(), lr=lr)
#交叉熵损失
cross_ent_loss = nn.CrossEntropyLoss()
#判断是否满足停止条件
def early_termination(loss_change, change_threshold, epoch, max_epochs):
terminate_for_loss_change = loss_change < change_threshold
terminate_for_epochs = epoch > max_epochs
#return terminate_for_loss_change or
return terminate_for_epochs
#开始训练
#直到满足停止条件
while not early_termination(loss_change, change_threshold, epoch, max_epochs):
for _ in range(n_batches):
#获取训练数据
x_data, y_target = get_toy_data(batch_size)
#梯度清零
mlp2.zero_grad()
#前向传播
y_pred = mlp2(x_data).squeeze()
#计算损失
loss = cross_ent_loss(y_pred, y_target.long())
#反向传播
loss.backward()
#更新模型参数
optimizer.step()
#记录训练信息
loss_value = loss.item()
losses.append(loss_value)
loss_change = abs(last_loss - loss_value)
last_loss = loss_value
#预测结果可视化
fig, ax = plt.subplots(1, 1, figsize=(10,5))
visualize_results(mlp2, x_data_static, y_truth_static, ax=ax, epoch=epoch,
title=f"{loss_value:0.2f}; {loss_change:0.4f}")
plt.axis('off')
epoch += 1
all_imagefiles.append(f'images/mlp2_epoch{epoch}_toylearning.png')
plt.savefig(all_imagefiles[-1])
上述过程通过定义、实例化、训练得到了一个两层MLP模型。
图4展示了达到最大训练轮数(max_epoch=15)时,模型的分类结果。
3.实验步骤
简要介绍了MLP的相关知识后,下面开始本次实验的主要内容:使用MLP实现姓氏分类任务。
3.1The Surname Dataset
-
数据集介绍:
本次实验所用数据集包含了10,000个姓氏,覆盖了18个不同的国家。这些姓氏是从互联网上的多种姓名来源收集而来,反映了各国姓氏的多样性。分析数据集后不难看出数据集有两个显著的特征:首先数据集中存在显著的类别不平衡问题,其中英语、俄语和阿拉伯语姓氏占据了大部分(共62%),尤其是英语(27%)和俄语(21%)姓氏。这种不平衡性可能会对模型训练造成挑战,因为它可能导致模型偏向于频繁出现的类别。其次,数据集中姓氏的拼写与其所属国家之间存在明显的关联,意味着某些拼写模式可以直接指示其国家来源,为模型提供了解决问题的线索。 -
数据预处理:
为了解决数据不平衡问题,采取了对过度代表的类别(俄语姓氏)进行下采样的策略。通过随机选取一部分俄语姓氏,使各类别的分布更为均衡,有助于模型学习到所有类别的特征,提高整体预测准确性。此外,数据被分成训练集(70%)、验证集(15%)和测试集(15%),并且在分集时保持了各部分间国籍标签的分布相似,以确保模型的训练、调优和最终评估是在代表性的数据上进行的。这样的划分有助于评估模型在未见数据上的泛化能力。
class SurnameDataset(Dataset):
def __init__(self, surname_df, vectorizer):
"""
Args:
surname_df (pandas.DataFrame): the dataset
vectorizer (SurnameVectorizer): vectorizer instatiated from dataset
"""
self.surname_df = surname_df
self._vectorizer = vectorizer
#训练集、验证集、测试集
self.train_df = self.surname_df[self.surname_df.split=='train']
self.train_size = len(self.train_df)
self.val_df = self.surname_df[self.surname_df.split=='val']
self.validation_size = len(self.val_df)
self.test_df = self.surname_df[self.surname_df.split=='test']
self.test_size = len(self.test_df)
#创建一个查找字典以便快速设置数据划分
self._lookup_dict = {'train': (self.train_df, self.train_size),
'val': (self.val_df, self.validation_size),
'test': (self.test_df, self.test_size)}
self.set_split('train')
# Class weights
#计算每个类别的频率并生成类别权重,用于样本不平衡问题
class_counts = surname_df.nationality.value_counts().to_dict()
def sort_key(item):
return self._vectorizer.nationality_vocab.lookup_token(item[0])
sorted_counts = sorted(class_counts.items(), key=sort_key)
frequencies = [count for _, count in sorted_counts]
self.class_weights = 1.0 / torch.tensor(frequencies, dtype=torch.float32)
@classmethod
#加载数据集并创建新的向量化器
def load_dataset_and_make_vectorizer(cls, surname_csv):
"""Load dataset and make a new vectorizer from scratch
Args:
surname_csv (str): location of the dataset
Returns:
an instance of SurnameDataset
"""
surname_df = pd.read_csv(surname_csv)
train_surname_df = surname_df[surname_df.split=='train']
return cls(surname_df, SurnameVectorizer.from_dataframe(train_surname_df))
@classmethod
#加载数据集和已缓存的向量化器
def load_dataset_and_load_vectorizer(cls, surname_csv, vectorizer_filepath):
"""Load dataset and the corresponding vectorizer.
Used in the case in the vectorizer has been cached for re-use
Args:
surname_csv (str): location of the dataset
vectorizer_filepath (str): location of the saved vectorizer
Returns:
an instance of SurnameDataset
"""
surname_df = pd.read_csv(surname_csv)
vectorizer = cls.load_vectorizer_only(vectorizer_filepath)
return cls(surname_df, vectorizer)
@staticmethod
def load_vectorizer_only(vectorizer_filepath):
"""a static method for loading the vectorizer from file
Args:
vectorizer_filepath (str): the location of the serialized vectorizer
Returns:
an instance of SurnameVectorizer
"""
with open(vectorizer_filepath) as fp:
return SurnameVectorizer.from_serializable(json.load(fp))
def save_vectorizer(self, vectorizer_filepath):
"""saves the vectorizer to disk using json
Args:
vectorizer_filepath (str): the location to save the vectorizer
"""
with open(vectorizer_filepath, "w") as fp:
json.dump(self._vectorizer.to_serializable(), fp)
def get_vectorizer(self):
""" returns the vectorizer """
return self._vectorizer
def set_split(self, split="train"):
""" selects the splits in the dataset using a column in the dataframe """
self._target_split = split
self._target_df, self._target_size = self._lookup_dict[split]
def __len__(self):
return self._target_size
#根据索引获取数据样本
def __getitem__(self, index):
"""the primary entry point method for PyTorch datasets
Args:
index (int): the index to the data point
Returns:
a dictionary holding the data point's:
features (x_surname)
label (y_nationality)
"""
row = self._target_df.iloc[index]
surname_vector = \
self._vectorizer.vectorize(row.surname)
nationality_index = \
self._vectorizer.nationality_vocab.lookup_token(row.nationality)
return {'x_surname': surname_vector,
'y_nationality': nationality_index}
#计算批次总数
def get_num_batches(self, batch_size):
"""Given a batch size, return the number of batches in the dataset
Args:
batch_size (int)
Returns:
number of batches in the dataset
"""
return len(self) // batch_size
#生成器函数,用于根据指定的批次大小、是否打乱数据等设置,生成每个所需的数据
def generate_batches(dataset, batch_size, shuffle=True,
drop_last=True, device="cpu"):
"""
A generator function which wraps the PyTorch DataLoader. It will
ensure each tensor is on the write device location.
"""
dataloader = DataLoader(dataset=dataset, batch_size=batch_size,
shuffle=shuffle, drop_last=drop_last)
for data_dict in dataloader:
out_data_dict = {}
for name, tensor in data_dict.items():
out_data_dict[name] = data_dict[name].to(device)
yield out_data_dict
3.2 Vocabulary, Vectorizer, and DataLoader
3.2.1 The Vocabulary
vocabulary类负责处理文本数据并提取词汇表,实现从字符(或单词)到索引的映射,同时也支持从索引到字符的逆映射。
class Vocabulary(object):
"""Class to process text and extract vocabulary for mapping"""
def __init__(self, token_to_idx=None, add_unk=True, unk_token="<UNK>"):
"""
Args:
token_to_idx (dict): a pre-existing map of tokens to indices #令牌到索引映射字典
add_unk (bool): a flag that indicates whether to add the UNK token #是否添加未知标记
unk_token (str): the UNK token to add into the Vocabulary #未知标记的符号
"""
#如果没有提供初始映射,则创建一个空字典
if token_to_idx is None:
token_to_idx = {}
self._token_to_idx = token_to_idx
#创建索引到令牌的反向映射
self._idx_to_token = {idx: token
for token, idx in self._token_to_idx.items()}
#是否添加UNK标记的标志和UNK标记本身
self._add_unk = add_unk
self._unk_token = unk_token
#初始化UNK标记的索引为-1
self.unk_index = -1
#添加UNK标记到词汇表中
if add_unk:
self.unk_index = self.add_token(unk_token)
#生成可序列化的字典形式
def to_serializable(self):
""" returns a dictionary that can be serialized """
return {'token_to_idx': self._token_to_idx,
'add_unk': self._add_unk,
'unk_token': self._unk_token}
@classmethod
#根据保存的内容重新创建对象
def from_serializable(cls, contents):
""" instantiates the Vocabulary from a serialized dictionary """
return cls(**contents)
#根据令牌更新映射字典,并返回其索引
def add_token(self, token):
"""Update mapping dicts based on the token.
Args:
token (str): the item to add into the Vocabulary
Returns:
index (int): the integer corresponding to the token
"""
try:
index = self._token_to_idx[token] #如果令牌已存在,则直接返回其索引
except KeyError:
#如果令牌不存在,则分配新的索引并添加到映射中
index = len(self._token_to_idx)
self._token_to_idx[token] = index
self._idx_to_token[index] = token
return index
#批量添加令牌列表到词汇表
def add_many(self, tokens):
"""Add a list of tokens into the Vocabulary
Args:
tokens (list): a list of string tokens
Returns:
indices (list): a list of indices corresponding to the tokens
"""
#对列表中的每个令牌调用add_token方法,并收集返回的索引
return [self.add_token(token) for token in tokens]
#查找令牌对应的索引,如果令牌不在词汇表中,则返回UNK的索引。
def lookup_token(self, token):
"""Retrieve the index associated with the token
or the UNK index if token isn't present.
Args:
token (str): the token to look up
Returns:
index (int): the index corresponding to the token
Notes:
`unk_index` needs to be >=0 (having been added into the Vocabulary)
for the UNK functionality
"""
if self.unk_index >= 0:
return self._token_to_idx.get(token, self.unk_index)
else:
return self._token_to_idx[token]
#根据索引查找并返回对应的令牌
def lookup_index(self, index):
"""Return the token associated with the index
Args:
index (int): the index to look up
Returns:
token (str): the token corresponding to the index
Raises:
KeyError: if the index is not in the Vocabulary
"""
if index not in self._idx_to_token:
raise KeyError("the index (%d) is not in the Vocabulary" % index)
return self._idx_to_token[index]
def __str__(self):
return "<Vocabulary(size=%d)>" % len(self)
#返回词汇表中令牌的数量
def __len__(self):
return len(self._token_to_idx)
3.2.2 The SurnameVectorizer
SurnameVectorizer 类不仅负责构建必要的词汇表来映射字符和类别到唯一的整数标识符,还提供了将文本数据转换为模型输入所需格式(one-hot编码)的方法,同时支持序列化与反序列化操作。
class SurnameVectorizer(object):
""" The Vectorizer which coordinates the Vocabularies and puts them to use"""
#初始化
def __init__(self, surname_vocab, nationality_vocab):
"""
Args:
surname_vocab (Vocabulary): maps characters to integers
nationality_vocab (Vocabulary): maps nationalities to integers
"""
self.surname_vocab = surname_vocab
self.nationality_vocab = nationality_vocab
#对单个姓氏进行向量化处理
def vectorize(self, surname):
"""
Args:
surname (str): the surname
Returns:
one_hot (np.ndarray): a collapsed one-hot encoding
"""
vocab = self.surname_vocab
one_hot = np.zeros(len(vocab), dtype=np.float32)
for token in surname:
one_hot[vocab.lookup_token(token)] = 1
return one_hot
@classmethod
#根据数据集DataFrame实例化向量化器
def from_dataframe(cls, surname_df):
"""Instantiate the vectorizer from the dataset dataframe
Args:
surname_df (pandas.DataFrame): the surnames dataset
Returns:
an instance of the SurnameVectorizer
"""
surname_vocab = Vocabulary(unk_token="@") #unk_token用于表示未知字符
nationality_vocab = Vocabulary(add_unk=False) #不添加unk_token,假设所有国籍已知
for index, row in surname_df.iterrows():
for letter in row.surname:
surname_vocab.add_token(letter) #添加每个字符到surname_vocab中
nationality_vocab.add_token(row.nationality) #添加国籍到nationality_vocab中
return cls(surname_vocab, nationality_vocab)
@classmethod
#从序列化内容中恢复向量化器实例
def from_serializable(cls, contents):
surname_vocab = Vocabulary.from_serializable(contents['surname_vocab'])
nationality_vocab = Vocabulary.from_serializable(contents['nationality_vocab'])
return cls(surname_vocab=surname_vocab, nationality_vocab=nationality_vocab)
#将向量化器实例转换为可序列化的字典形式
def to_serializable(self):
return {'surname_vocab': self.surname_vocab.to_serializable(),
'nationality_vocab': self.nationality_vocab.to_serializable()}
3.3 Classifier Model
构建了一个两层MLP模型,用于分类任务。
第一个线性层将输入向量映射到中间向量,并对该向量应用非线性。
第二线性层将中间向量映射到预测向量。
#两层MLP模型
class SurnameClassifier(nn.Module):
""" A 2-layer Multilayer Perceptron for classifying surnames """
def __init__(self, input_dim, hidden_dim, output_dim):
"""
Args:
input_dim (int): the size of the input vectors
hidden_dim (int): the output size of the first Linear layer
output_dim (int): the output size of the second Linear layer
"""
super(SurnameClassifier, self).__init__()
self.fc1 = nn.Linear(input_dim, hidden_dim) #第一个全连接层
self.fc2 = nn.Linear(hidden_dim, output_dim) #第二个全连接层
#前向传播
def forward(self, x_in, apply_softmax=False):
"""The forward pass of the classifier
Args:
x_in (torch.Tensor): an input data tensor.
x_in.shape should be (batch, input_dim)
apply_softmax (bool): a flag for the softmax activation
should be false if used with the Cross Entropy losses
Returns:
the resulting tensor. tensor.shape should be (batch, output_dim)
"""
intermediate_vector = F.relu(self.fc1(x_in))
prediction_vector = self.fc2(intermediate_vector)
#是否使用softmax激活函数
if apply_softmax:
prediction_vector = F.softmax(prediction_vector, dim=1)
return prediction_vector
3.4 Training Routine
模型训练的核心步骤与2.4节的简单示例相同,只是面向不同的任务的数据准备、训练准备等有不同。
3.4.1 Helper functions, settings and some prep work.
#初始化训练状态字典
def make_train_state(args):
return {'stop_early': False, #早停标志
'early_stopping_step': 0,
'early_stopping_best_val': 1e8,
'learning_rate': args.learning_rate,
'epoch_index': 0,
'train_loss': [],
'train_acc': [],
'val_loss': [],
'val_acc': [],
'test_loss': -1,
'test_acc': -1,
'model_filename': args.model_state_file}
#在训练过程中更新训练状态
def update_train_state(args, model, train_state):
"""Handle the training state updates.
Components:
- Early Stopping: Prevent overfitting.
- Model Checkpoint: Model is saved if the model is better
:param args: main arguments
:param model: model to train
:param train_state: a dictionary representing the training state values
:returns:
a new train_state
"""
# Save one model at least
if train_state['epoch_index'] == 0:
torch.save(model.state_dict(), train_state['model_filename'])
train_state['stop_early'] = False
# Save model if performance improved
# 如果性能提高,则保存模型
elif train_state['epoch_index'] >= 1:
loss_tm1, loss_t = train_state['val_loss'][-2:]
# If loss worsened
#如果损失变大
if loss_t >= train_state['early_stopping_best_val']:
# Update step
train_state['early_stopping_step'] += 1
# Loss decreased
#如果损失减小
else:
# Save the best model
if loss_t < train_state['early_stopping_best_val']:
torch.save(model.state_dict(), train_state['model_filename'])
# Reset early stopping step
train_state['early_stopping_step'] = 0
# Stop early ?
#判断是否提前停止训练
train_state['stop_early'] = \
train_state['early_stopping_step'] >= args.early_stopping_criteria
return train_state
#计算准确率
def compute_accuracy(y_pred, y_target):
_, y_pred_indices = y_pred.max(dim=1)
n_correct = torch.eq(y_pred_indices, y_target).sum().item()
return n_correct / len(y_pred_indices) * 100
#随机种子
def set_seed_everywhere(seed, cuda):
np.random.seed(seed)
torch.manual_seed(seed)
if cuda:
torch.cuda.manual_seed_all(seed)
#处理目录,确保目标目录存在
def handle_dirs(dirpath):
if not os.path.exists(dirpath):
os.makedirs(dirpath)
args = Namespace(
# Data and path information
#数据与路径信息
surname_csv="data/surnames/surnames_with_splits.csv",
vectorizer_file="vectorizer.json",
model_state_file="model.pth",
save_dir="model_storage/ch4/surname_mlp",
# Model hyper parameters
#模型超参数
hidden_dim=300,
# Training hyper parameters
#训练超参数
seed=1337,
num_epochs=100,
early_stopping_criteria=5,
learning_rate=0.001,
batch_size=64,
# Runtime options
cuda=False,
reload_from_files=False,
expand_filepaths_to_save_dir=True,
)
#如果需要将文件路径扩展到保存目录
if args.expand_filepaths_to_save_dir:
args.vectorizer_file = os.path.join(args.save_dir,
args.vectorizer_file)
args.model_state_file = os.path.join(args.save_dir,
args.model_state_file)
print("Expanded filepaths: ")
print("\t{}".format(args.vectorizer_file))
print("\t{}".format(args.model_state_file))
# Check CUDA
#检查cuda是否可用
if not torch.cuda.is_available():
args.cuda = False
args.device = torch.device("cuda" if args.cuda else "cpu")
print("Using CUDA: {}".format(args.cuda))
# Set seed for reproducibility
set_seed_everywhere(args.seed, args.cuda)
#处理目录
handle_dirs(args.save_dir)
3.4.2 Initiation
if args.reload_from_files:
# training from a checkpoint
#从检查点开始训练
print("Reloading!")
dataset = SurnameDataset.load_dataset_and_load_vectorizer(args.surname_csv,
args.vectorizer_file)
else:
# create dataset and vectorizer创建新的数据集和向量机
print("Creating fresh!")
dataset = SurnameDataset.load_dataset_and_make_vectorizer(args.surname_csv)
dataset.save_vectorizer(args.vectorizer_file)
#实例化分类器
vectorizer = dataset.get_vectorizer()
classifier = SurnameClassifier(input_dim=len(vectorizer.surname_vocab), #输入层维度
hidden_dim=args.hidden_dim, #隐藏层维度
output_dim=len(vectorizer.nationality_vocab)) #输出层维度
3.4.3 Training loop and evaluate
训练与验证过程中使用不同的key从batch_dict中获取数据。
训练流程遵循常见的前馈神经网络训练流程,迭代进行前向传播、计算损失、反向传播、最后根据梯度计算结果和选择的优化算法对模型参数进行优化,直至达到预设的最大训练轮次或达到早停条件。
#模型训练
classifier = classifier.to(args.device)
dataset.class_weights = dataset.class_weights.to(args.device)
#交叉熵损失函数
loss_func = nn.CrossEntropyLoss(dataset.class_weights)
#Adam优化算法
optimizer = optim.Adam(classifier.parameters(), lr=args.learning_rate)
#学习率调度器
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer=optimizer,
mode='min', factor=0.5,
patience=1)
#记录训练状态
train_state = make_train_state(args)
#设置进度条,展示训练进度
epoch_bar = tqdm_notebook(desc='training routine',
total=args.num_epochs,
position=0)
dataset.set_split('train')
train_bar = tqdm_notebook(desc='split=train',
total=dataset.get_num_batches(args.batch_size),
position=1,
leave=True)
dataset.set_split('val')
val_bar = tqdm_notebook(desc='split=val',
total=dataset.get_num_batches(args.batch_size),
position=1,
leave=True)
try:
#开始训练
for epoch_index in range(args.num_epochs):
train_state['epoch_index'] = epoch_index
# Iterate over training dataset
# setup: batch generator, set loss and acc to 0, set train mode on
dataset.set_split('train')
#生成每个批次的训练数据
batch_generator = generate_batches(dataset,
batch_size=args.batch_size,
device=args.device)
running_loss = 0.0
running_acc = 0.0
classifier.train()
for batch_index, batch_dict in enumerate(batch_generator):
# 梯度清零
optimizer.zero_grad()
# 前向传播
y_pred = classifier(batch_dict['x_surname'])
# 计算损失
loss = loss_func(y_pred, batch_dict['y_nationality'])
loss_t = loss.item()
running_loss += (loss_t - running_loss) / (batch_index + 1)
# 反向传播
loss.backward()
# 更新参数
optimizer.step()
# 计算准确率
acc_t = compute_accuracy(y_pred, batch_dict['y_nationality'])
running_acc += (acc_t - running_acc) / (batch_index + 1)
# 更新进度条
train_bar.set_postfix(loss=running_loss, acc=running_acc,
epoch=epoch_index)
train_bar.update()
#更新训练状态
train_state['train_loss'].append(running_loss)
train_state['train_acc'].append(running_acc)
# Iterate over val dataset
#遍历验证数据集
dataset.set_split('val')
batch_generator = generate_batches(dataset,
batch_size=args.batch_size,
device=args.device)
running_loss = 0.
running_acc = 0.
#设置模型为评估模式
classifier.eval()
for batch_index, batch_dict in enumerate(batch_generator):
# 计算输出
y_pred = classifier(batch_dict['x_surname'])
# 计算损失
loss = loss_func(y_pred, batch_dict['y_nationality'])
loss_t = loss.to("cpu").item()
running_loss += (loss_t - running_loss) / (batch_index + 1)
# 计算准确率
acc_t = compute_accuracy(y_pred, batch_dict['y_nationality'])
running_acc += (acc_t - running_acc) / (batch_index + 1)
val_bar.set_postfix(loss=running_loss, acc=running_acc,
epoch=epoch_index)
val_bar.update()
#更新训练状态
train_state['val_loss'].append(running_loss)
train_state['val_acc'].append(running_acc)
train_state = update_train_state(args=args, model=classifier, train_state=train_state)
scheduler.step(train_state['val_loss'][-1])
#如果满足早停条件 则结束训练
if train_state['stop_early']:
break
train_bar.n = 0
val_bar.n = 0
epoch_bar.update()
except KeyboardInterrupt:
print("Exiting loop")
3.5 Prediction
3.5.1 Classifying a new surname
给定一个姓氏作为字符串,该函数将首先应用向量化过程,然后获得模型预测。注意,我们包含了apply_softmax标志,所以结果包含概率。模型预测,在多项式的情况下,是类概率的列表。我们使用PyTorch张量最大函数来得到由最高预测概率表示的最优类。
#根据姓氏预测国籍
def predict_nationality(surname, classifier, vectorizer):
"""Predict the nationality from a new surname
Args:
surname (str): the surname to classifier
classifier (SurnameClassifer): an instance of the classifier
vectorizer (SurnameVectorizer): the corresponding vectorizer
Returns:
a dictionary with the most likely nationality and its probability
"""
#将输入的姓氏向量化
vectorized_surname = vectorizer.vectorize(surname)
#将向量化后的姓氏转换为张量
vectorized_surname = torch.tensor(vectorized_surname).view(1, -1)
#使用分类器进行预测,并使用softmax函数得到概率分布
result = classifier(vectorized_surname, apply_softmax=True)
#获取概率分布最大值和对应的索引
probability_values, indices = result.max(dim=1)
index = indices.item()
#根据索引从向量化器的国籍词汇表中查找相应的国籍
predicted_nationality = vectorizer.nationality_vocab.lookup_index(index)
probability_value = probability_values.item()
return {'nationality': predicted_nationality, 'probability': probability_value}
#进行预测
new_surname = input("Enter a surname to classify: ")
classifier = classifier.to("cpu")
prediction = predict_nationality(new_surname, classifier, vectorizer)
print("{} -> {} (p={:0.2f})".format(new_surname,
prediction['nationality'],
prediction['probability']))
预测结果:
3.5.2 Retrieving the top-k predictions for a new surname
不仅要看最好的预测,还要看更多的预测。例如,NLP中的标准实践是采用k-best预测并使用另一个模型对它们重新排序。PyTorch提供了一个torch.topk函数,它提供了一种方便的方法来获得这些预测。
#预测名字的最高k个可能的国籍
def predict_topk_nationality(name, classifier, vectorizer, k=5):
# 将输入的名字向量化
vectorized_name = vectorizer.vectorize(name)
# 将向量化后的名字转换为张量,并视图展平
vectorized_name = torch.tensor(vectorized_name).view(1, -1)
# 预测名字的国籍,并应用softmax激活函数得到概率分布
prediction_vector = classifier(vectorized_name, apply_softmax=True)
# 获取概率分布的最高k个概率值和对应的索引
probability_values, indices = torch.topk(prediction_vector, k=k)
# 将概率值和索引从张量转换为numpy数组
probability_values = probability_values.detach().numpy()[0]
indices = indices.detach().numpy()[0]
# 初始化结果列表
results = []
# 遍历概率值和索引,构建每个预测结果的字典
for prob_value, index in zip(probability_values, indices):
nationality = vectorizer.nationality_vocab.lookup_index(index)
results.append({'nationality': nationality,
'probability': prob_value})
# 返回结果列表
return results
# 获取用户输入的新姓氏
new_surname = input("Enter a surname to classify: ")
# 将分类器移动到CPU上
classifier = classifier.to("cpu")
# 获取用户想要查看的预测数量
k = int(input("How many of the top predictions to see? "))
# 如果用户请求的预测数量超过所有可能的国籍数量
if k > len(vectorizer.nationality_vocab):
# 打印提示信息,并默认使用最大的国籍数量
print("Sorry! That's more than the # of nationalities we have.. defaulting you to max size :)")
k = len(vectorizer.nationality_vocab)
# 调用函数预测新姓氏的国籍
predictions = predict_topk_nationality(new_surname, classifier, vectorizer, k=k)
# 打印预测结果
print("Top {} predictions:".format(k))
print("===================")
# 遍历预测结果,打印每个预测的国籍及其概率
for prediction in predictions:
print("{} -> {} (p={:0.2f})".format(new_surname,
prediction['nationality'],
prediction['probability']))
预测结果:
That’s all.