基于多层感知机和CNN分别实现姓氏分类模型

一.多层感知机简介

多层感知机(Multilayer Perceptron,MLP)是一种经典的前馈神经网络,是深度学习中最基础的模型之一。它由一个输入层、一个或多个隐藏层和一个输出层组成,每一层中的神经元与相邻层中的神经元全连接。MLP能够处理非线性问题,是很多复杂模型的基础。

MLP的基本结构

  1. 输入层(Input Layer)

    • 输入层的神经元数与输入数据的特征数一致。例如,对于一个有n个特征的数据集,输入层就有n个神经元。
  2. 隐藏层(Hidden Layer)

    • 隐藏层是介于输入层和输出层之间的层,可以有一层或多层。每一层中的神经元数可以自行设定。隐藏层引入非线性激活函数,使得网络能够学习复杂的非线性映射。
  3. 输出层(Output Layer)

    • 输出层的神经元数与预测任务有关。例如,对于回归任务,输出层通常只有一个神经元;对于分类任务,输出层的神经元数等于类别数。

MLP的工作原理

  1. 前向传播(Forward Propagation)

    • 输入数据通过输入层进入网络,逐层传递至输出层。在每一层,输入与权重矩阵相乘,再加上偏置(bias),然后通过激活函数得到该层的输出。
  2. 激活函数(Activation Function)

    • 常用的激活函数包括ReLU(Rectified Linear Unit)、Sigmoid、Tanh等。激活函数的引入使得网络能够处理非线性问题。
  3. 损失函数(Loss Function)

    • 损失函数用于衡量网络预测输出与真实标签之间的差距。常用的损失函数有均方误差(MSE)用于回归问题,交叉熵损失(Cross-Entropy Loss)用于分类问题。
  4. 反向传播(Backpropagation)

    • 反向传播算法用于更新网络的权重和偏置,以最小化损失函数。它通过计算损失函数相对于每个参数的梯度,并使用梯度下降法进行参数更新。

MLP的优缺点

优点

  • 能够处理非线性问题。
  • 结构简单,易于实现和理解。
  • 适用于多种任务,包括回归和分类。

缺点

  • 对于高维数据和复杂任务,浅层的MLP可能表现不佳,需要更多的隐藏层和神经元。
  • 训练过程可能出现过拟合,需要使用正则化方法(如Dropout)来改善。

二.数据集简介

import collections
import numpy as np
import pandas as pd
import re

from argparse import Namespace

args = Namespace(
    raw_dataset_csv="data/surnames/surnames.csv",
    train_proportion=0.7,
    val_proportion=0.15,
    test_proportion=0.15,
    output_munged_csv="data/surnames/surnames_with_splits.csv",
    seed=1337
)


# Read raw data
surnames = pd.read_csv(args.raw_dataset_csv, header=0)
surnames.head()
'''查看数据集前几行数据''' 

查看数据集前几行数据:

# 获取唯一的国籍值
set(surnames.nationality)

获取国籍唯一值:

# Splitting train by nationality
# Create dict
by_nationality = collections.defaultdict(list)
for _, row in surnames.iterrows():
    by_nationality[row.nationality].append(row.to_dict())

# Create split data
final_list = []
np.random.seed(args.seed)
for _, item_list in sorted(by_nationality.items()):
    np.random.shuffle(item_list)
    n = len(item_list)
    n_train = int(args.train_proportion*n)
    n_val = int(args.val_proportion*n)
    n_test = int(args.test_proportion*n)
    
    # Give data point a split attribute
    for item in item_list[:n_train]:
        item['split'] = 'train'
    for item in item_list[n_train:n_train+n_val]:
        item['split'] = 'val'
    for item in item_list[n_train+n_val:]:
        item['split'] = 'test'  
    
    # Add to final list
    final_list.extend(item_list)



# Write split data to file
final_surnames = pd.DataFrame(final_list)


final_surnames.split.value_counts()


数据集划分结果:

final_surnames.head()

最终数据集前几行:

三.利用多层感知机(MLP)实现姓氏分类

1.导入第三方库

#导入必要的第三方库
from argparse import Namespace
from collections import Counter
import json
import os
import string
 
import numpy as np
import pandas as pd
 
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm_notebook

2.数据集类的加载及文本处理

#创建一个自定义的数据集类来加载和处理文本数据
#划分训练、验证和测试集,设置数据集大小,以及构建查找字典
class SurnameDataset(Dataset):
    def __init__(self, surname_df, vectorizer):
        """
        Args:
            surname_df (pandas.DataFrame): the dataset
            vectorizer (SurnameVectorizer): vectorizer instatiated from dataset
        """
        self.surname_df = surname_df
        self._vectorizer = vectorizer
 
        self.train_df = self.surname_df[self.surname_df.split=='train']
        self.train_size = len(self.train_df)
 
        self.val_df = self.surname_df[self.surname_df.split=='val']
        self.validation_size = len(self.val_df)
 
        self.test_df = self.surname_df[self.surname_df.split=='test']
        self.test_size = len(self.test_df)
 
        self._lookup_dict = {'train': (self.train_df, self.train_size),
                             'val': (self.val_df, self.validation_size),
                             'test': (self.test_df, self.test_size)}
 
        self.set_split('train')
        
        # Class weights
        class_counts = surname_df.nationality.value_counts().to_dict()
        def sort_key(item):
            return self._vectorizer.nationality_vocab.lookup_token(item[0])
        sorted_counts = sorted(class_counts.items(), key=sort_key)
        frequencies = [count for _, count in sorted_counts]
        self.class_weights = 1.0 / torch.tensor(frequencies, dtype=torch.float32)
        
#加载数据集并生成新的向量化器
    @classmethod
    def load_dataset_and_make_vectorizer(cls, surname_csv):
        """Load dataset and make a new vectorizer from scratch
        
        Args:
            surname_csv (str): location of the dataset
        Returns:
            an instance of SurnameDataset
        """
        surname_df = pd.read_csv(surname_csv)
        train_surname_df = surname_df[surname_df.split=='train']
        return cls(surname_df, SurnameVectorizer.from_dataframe(train_surname_df))
 
#加载数据集和相应的向量化器,用于重新使用已缓存的向量化器
    @classmethod
    def load_dataset_and_load_vectorizer(cls, surname_csv, vectorizer_filepath):
        """Load dataset and the corresponding vectorizer. 
        Used in the case in the vectorizer has been cached for re-use
        
        Args:
            surname_csv (str): location of the dataset
            vectorizer_filepath (str): location of the saved vectorizer
        Returns:
            an instance of SurnameDataset
        """
        surname_df = pd.read_csv(surname_csv)
        vectorizer = cls.load_vectorizer_only(vectorizer_filepath)
        return cls(surname_df, vectorizer)
    
#从文件中加载向量化器
    @staticmethod
    def load_vectorizer_only(vectorizer_filepath):
        """a static method for loading the vectorizer from file
        
        Args:
            vectorizer_filepath (str): the location of the serialized vectorizer
        Returns:
            an instance of SurnameVectorizer
        """
        with open(vectorizer_filepath) as fp:
            return SurnameVectorizer.from_serializable(json.load(fp))
 
#将向量化器保存到磁盘       
    def save_vectorizer(self, vectorizer_filepath):
        """saves the vectorizer to disk using json
        
        Args:
            vectorizer_filepath (str): the location to save the vectorizer
        """
        with open(vectorizer_filepath, "w") as fp:
            json.dump(self._vectorizer.to_serializable(), fp)
 
#回向量化器对象
    def get_vectorizer(self):
        """ returns the vectorizer """
        return self._vectorizer
 
#划分数据集
    def set_split(self, split="train"):
        """ selects the splits in the dataset using a column in the dataframe """
        self._target_split = split
        self._target_df, self._target_size = self._lookup_dict[split]
 
#返回指定数据集的大小
    def __len__(self):
        return self._target_size
 
#获取指定索引的数据点,并将文本数据向量化,标签进行编码后返回
    def __getitem__(self, index):
        """the primary entry point method for PyTorch datasets
        
        Args:
            index (int): the index to the data point 
        Returns:
            a dictionary holding the data point's:
                features (x_surname)
                label (y_nationality)
        """
        row = self._target_df.iloc[index]
 
        surname_vector = \
            self._vectorizer.vectorize(row.surname)
 
        nationality_index = \
            self._vectorizer.nationality_vocab.lookup_token(row.nationality)
 
        return {'x_surname': surname_vector,
                'y_nationality': nationality_index}
 
#根据指定的批量大小返回数据集中的批次数量
    def get_num_batches(self, batch_size):
        """Given a batch size, return the number of batches in the dataset
        
        Args:
            batch_size (int)
        Returns:
            number of batches in the dataset
        """
        return len(self) // batch_size
 
#生成批次数据    
def generate_batches(dataset, batch_size, shuffle=True,
                     drop_last=True, device="cpu"): 
    """
    A generator function which wraps the PyTorch DataLoader. It will 
      ensure each tensor is on the write device location.
    """
    dataloader = DataLoader(dataset=dataset, batch_size=batch_size,
                            shuffle=shuffle, drop_last=drop_last)
 
    for data_dict in dataloader:
        out_data_dict = {}
        for name, tensor in data_dict.items():
            out_data_dict[name] = data_dict[name].to(device)
        yield out_data_dict

3.将姓氏字符串转换为向量化的minibatches

class Vocabulary(object):
    """Class to process text and extract vocabulary for mapping"""
#初始化词汇表
    def __init__(self, token_to_idx=None, add_unk=True, unk_token="<UNK>"):
        """
        Args:
            token_to_idx (dict): a pre-existing map of tokens to indices
            add_unk (bool): a flag that indicates whether to add the UNK token
            unk_token (str): the UNK token to add into the Vocabulary
        """
 
        if token_to_idx is None:
            token_to_idx = {}
        self._token_to_idx = token_to_idx
 
        self._idx_to_token = {idx: token 
                              for token, idx in self._token_to_idx.items()}
        
        self._add_unk = add_unk
        self._unk_token = unk_token
        
        self.unk_index = -1
        if add_unk:
            self.unk_index = self.add_token(unk_token) 
    #将词汇表保存为可序列化的格式
        
    def to_serializable(self):
        """ returns a dictionary that can be serialized """
        return {'token_to_idx': self._token_to_idx, 
                'add_unk': self._add_unk, 
                'unk_token': self._unk_token}
    #从序列化的字典实例化词汇表对象
    @classmethod
    def from_serializable(cls, contents):
        """ instantiates the Vocabulary from a serialized dictionary """
        return cls(**contents)
    #据输入的标记更新词汇表的映射字典
    def add_token(self, token):
        """Update mapping dicts based on the token.
        Args:
            token (str): the item to add into the Vocabulary
        Returns:
            index (int): the integer corresponding to the token
        """
        if token in self._token_to_idx:
            index = self._token_to_idx[token]
        else:
            index = len(self._token_to_idx)
            self._token_to_idx[token] = index
            self._idx_to_token[index] = token
        return index
    #将一个标记列表添加到词汇表
    def add_many(self, tokens):
        """Add a list of tokens into the Vocabulary
        
        Args:
            tokens (list): a list of string tokens
        Returns:
            indices (list): a list of indices corresponding to the tokens
        """
        return [self.add_token(token) for token in tokens]
    #用于检索与标记相关联的索引
    def lookup_token(self, token):
        """Retrieve the index associated with the token 
          or the UNK index if token isn't present.
        
        Args:
            token (str): the token to look up 
        Returns:
            index (int): the index corresponding to the token
        Notes:
            `unk_index` needs to be >=0 (having been added into the Vocabulary) 
              for the UNK functionality 
        """
        if self.unk_index >= 0:
            return self._token_to_idx.get(token, self.unk_index)
        else:
            return self._token_to_idx[token]
    #用于返回与给定索引相关联的标记
    def lookup_index(self, index):
        """Return the token associated with the index
        
        Args: 
            index (int): the index to look up
        Returns:
            token (str): the token corresponding to the index
        Raises:
            KeyError: if the index is not in the Vocabulary
        """
        if index not in self._idx_to_token:
            raise KeyError("the index (%d) is not in the Vocabulary" % index)
        return self._idx_to_token[index]
#返回描述词汇表大小的字符串表示
    def __str__(self):
        return "<Vocabulary(size=%d)>" % len(self)
#返回词汇表中的唯一标记数
    def __len__(self):
        return len(self._token_to_idx)

4.用多层感知器进行姓氏分类

class SurnameClassifier(nn.Module):
    """ 用于对姓氏进行分类的两层多层感知器 """
 
    def __init__(self, input_dim, hidden_dim, output_dim):
        """
        Args:
            input_dim (int): 输入向量的大小
            hidden_dim (int): 第一层线性层的输出大小
            output_dim (int): 第二层线性层的输出大小
        """
        super(SurnameClassifier, self).__init__()
        # 定义两个线性层
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, output_dim)
 
    def forward(self, x_in, apply_softmax=False):
        """分类器的前向传播
        Args:
            x_in (torch.Tensor): 输入数据张量。
                x_in.shape 应为 (batch, input_dim)
            apply_softmax (bool): 是否进行 softmax 激活。
                如果与交叉熵损失一起使用,则应为 False
        Returns:
            结果张量。tensor.shape 应为 (batch, output_dim)
        """
        # 第一层的线性变换,并使用 ReLU 激活函数
        intermediate_vector = F.relu(self.fc1(x_in))
        # 第二层的线性变换
        prediction_vector = self.fc2(intermediate_vector)
 
        if apply_softmax:
            # 如果需要应用 softmax 激活函数,则进行 softmax 操作
            prediction_vector = F.softmax(prediction_vector, dim=1)
 
        return prediction_vector

5.辅助函数

#创建一个表示训练状态的字典,初始化各种参数和指标
def make_train_state(args):
    return {'stop_early': False,
            'early_stopping_step': 0,
            'early_stopping_best_val': 1e8,
            'learning_rate': args.learning_rate,
            'epoch_index': 0,
            'train_loss': [],
            'train_acc': [],
            'val_loss': [],
            'val_acc': [],
            'test_loss': -1,
            'test_acc': -1,
            'model_filename': args.model_state_file}
 
#处理训练状态的更新
def update_train_state(args, model, train_state):
    """Handle the training state updates.
    Components:
     - Early Stopping: Prevent overfitting.
     - Model Checkpoint: Model is saved if the model is better
    :param args: main arguments
    :param model: model to train
    :param train_state: a dictionary representing the training state values
    :returns:
        a new train_state
    """
 
    # 至少保存一个模型
    if train_state['epoch_index'] == 0:
        torch.save(model.state_dict(), train_state['model_filename'])
        train_state['stop_early'] = False
 
    # 性能得到改善,则保存模型
    elif train_state['epoch_index'] >= 1:
        loss_tm1, loss_t = train_state['val_loss'][-2:]
 
        #如果损失恶化
        if loss_t >= train_state['early_stopping_best_val']:
            # Update step
            train_state['early_stopping_step'] += 1
        # 损失减少
        else:
            # 保存最佳模型
            if loss_t < train_state['early_stopping_best_val']:
                torch.save(model.state_dict(), train_state['model_filename'])
 
            # 重置提前停止步骤
            train_state['early_stopping_step'] = 0
 
        # 早停?
        train_state['stop_early'] = \
            train_state['early_stopping_step'] >= args.early_stopping_criteria
 
    return train_state
 
#计算模型预测的准确率
def compute_accuracy(y_pred, y_target):
    _, y_pred_indices = y_pred.max(dim=1)
    n_correct = torch.eq(y_pred_indices, y_target).sum().item()
    return n_correct / len(y_pred_indices) * 100
 
 
#设置随机种子
def set_seed_everywhere(seed, cuda):
    np.random.seed(seed)
    torch.manual_seed(seed)
    if cuda:
        torch.cuda.manual_seed_all(seed)
#处理文件目录
def handle_dirs(dirpath):
    if not os.path.exists(dirpath):
        os.makedirs(dirpath)
 
 
args = Namespace(
    # 数据和路径信息
    surname_csv="surnames_with_splits.csv",
    vectorizer_file="vectorizer.json",
    model_state_file="model.pth",
    save_dir="model_storage/ch4/surname_mlp",
    # 模型超参数
    hidden_dim=300,
    # 训练超参数
    seed=1337,
    num_epochs=5,
    early_stopping_criteria=5,
    learning_rate=0.001,
    batch_size=64,
   # 运行时选项
    cuda=False,
    reload_from_files=False,
    expand_filepaths_to_save_dir=True,
)
 
if args.expand_filepaths_to_save_dir:
    args.vectorizer_file = os.path.join(args.save_dir,
                                        args.vectorizer_file)
 
    args.model_state_file = os.path.join(args.save_dir,
                                         args.model_state_file)
    
    print("Expanded filepaths: ")
    print("\t{}".format(args.vectorizer_file))
    print("\t{}".format(args.model_state_file))
    
# 检查CUDA
if not torch.cuda.is_available():
    args.cuda = False
 
args.device = torch.device("cuda" if args.cuda else "cpu")
    
print("Using CUDA: {}".format(args.cuda))
 
 
# 为可重复性奠定种子
set_seed_everywhere(args.seed, args.cuda)
 
#句柄目录
handle_dirs(args.save_dir)
 
#确定是重新加载已有的数据集和词向量化器,还是创建新的数据集和词向量化器。
if args.reload_from_files:
    # training from a checkpoint
    print("Reloading!")
    dataset = SurnameDataset.load_dataset_and_load_vectorizer(args.surname_csv,
                                                              args.vectorizer_file)
else:
    # create dataset and vectorizer
    print("Creating fresh!")
    dataset = SurnameDataset.load_dataset_and_make_vectorizer(args.surname_csv)
    dataset.save_vectorizer(args.vectorizer_file)
    
vectorizer = dataset.get_vectorizer()
classifier = SurnameClassifier(input_dim=len(vectorizer.surname_vocab), 
                               hidden_dim=args.hidden_dim, 
                               output_dim=len(vectorizer.nationality_vocab))

6.模型训练

#模型训练的准备工作
classifier = classifier.to(args.device)
dataset.class_weights = dataset.class_weights.to(args.device)
 
    
loss_func = nn.CrossEntropyLoss(dataset.class_weights)
optimizer = optim.Adam(classifier.parameters(), lr=args.learning_rate)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer=optimizer,
                                                 mode='min', factor=0.5,
                                                 patience=1)
 
train_state = make_train_state(args)
 
epoch_bar = tqdm_notebook(desc='training routine', 
                          total=args.num_epochs,
                          position=0)
 
dataset.set_split('train')
train_bar = tqdm_notebook(desc='split=train',
                          total=dataset.get_num_batches(args.batch_size), 
                          position=1, 
                          leave=True)
dataset.set_split('val')
val_bar = tqdm_notebook(desc='split=val',
                        total=dataset.get_num_batches(args.batch_size), 
                        position=1, 
                        leave=True)
 
try:
    for epoch_index in range(args.num_epochs):
        train_state['epoch_index'] = epoch_index
 
        # 迭代训练数据集
 
        # 设置:批量发电机,设置损耗和acc为0
 
        dataset.set_split('train')
        batch_generator = generate_batches(dataset, 
                                           batch_size=args.batch_size, 
                                           device=args.device)
        running_loss = 0.0
        running_acc = 0.0
        classifier.train()
 
        for batch_index, batch_dict in enumerate(batch_generator):
            # 训练程序有以下5个步骤:
 
            # --------------------------------------
            # 步骤1.将梯度归零
            optimizer.zero_grad()
 
            #步骤2.计算输出
            y_pred = classifier(batch_dict['x_surname'])
 
            #步骤3.计算损失
            loss = loss_func(y_pred, batch_dict['y_nationality'])
            loss_t = loss.item()
            running_loss += (loss_t - running_loss) / (batch_index + 1)
 
            #步骤4.使用损耗来产生梯度
            loss.backward()
 
            #步骤5.使用优化器采取梯度步骤
            optimizer.step()
            # -----------------------------------------
            # 计算准确率
            acc_t = compute_accuracy(y_pred, batch_dict['y_nationality'])
            running_acc += (acc_t - running_acc) / (batch_index + 1)
 
            #更新值
            train_bar.set_postfix(loss=running_loss, acc=running_acc, 
                            epoch=epoch_index)
            train_bar.update()
 
        train_state['train_loss'].append(running_loss)
        train_state['train_acc'].append(running_acc)
 
        #遍历数据集
 
        #设置损耗和acc为0;设置eval模式为on
        dataset.set_split('val')
        batch_generator = generate_batches(dataset, 
                                           batch_size=args.batch_size, 
                                           device=args.device)
        running_loss = 0.
        running_acc = 0.
        classifier.eval()
 
        for batch_index, batch_dict in enumerate(batch_generator):
 
            # 输出
            y_pred =  classifier(batch_dict['x_surname'])
 
            # 计算损失
            loss = loss_func(y_pred, batch_dict['y_nationality'])
            loss_t = loss.to("cpu").item()
            running_loss += (loss_t - running_loss) / (batch_index + 1)
 
            # 计算准确率
            acc_t = compute_accuracy(y_pred, batch_dict['y_nationality'])
            running_acc += (acc_t - running_acc) / (batch_index + 1)
            val_bar.set_postfix(loss=running_loss, acc=running_acc, 
                            epoch=epoch_index)
            val_bar.update()
 
        train_state['val_loss'].append(running_loss)
        train_state['val_acc'].append(running_acc)
 
        train_state = update_train_state(args=args, model=classifier,
                                         train_state=train_state)
 
        scheduler.step(train_state['val_loss'][-1])
 
        if train_state['stop_early']:
            break
 
        train_bar.n = 0
        val_bar.n = 0
        epoch_bar.update()
except KeyboardInterrupt:
    print("Exiting loop")

7.得到最佳模型

# 使用最佳可用模型计算测试集上的损耗和精度
 
classifier.load_state_dict(torch.load(train_state['model_filename']))
 
classifier = classifier.to(args.device)
dataset.class_weights = dataset.class_weights.to(args.device)
loss_func = nn.CrossEntropyLoss(dataset.class_weights)
 
dataset.set_split('test')
batch_generator = generate_batches(dataset, 
                                   batch_size=args.batch_size, 
                                   device=args.device)
running_loss = 0.
running_acc = 0.
classifier.eval()
 
for batch_index, batch_dict in enumerate(batch_generator):
    # compute the output
    y_pred =  classifier(batch_dict['x_surname'])
    
    # compute the loss
    loss = loss_func(y_pred, batch_dict['y_nationality'])
    loss_t = loss.item()
    running_loss += (loss_t - running_loss) / (batch_index + 1)
 
    # compute the accuracy
    acc_t = compute_accuracy(y_pred, batch_dict['y_nationality'])
    running_acc += (acc_t - running_acc) / (batch_index + 1)
 
train_state['test_loss'] = running_loss
train_state['test_acc'] = running_acc
#输出损失和准确率
print("Test loss: {};".format(train_state['test_loss']))
print("Test Accuracy: {}".format(train_state['test_acc']))

训练集上的精度和损失:

8.结果预测

#使用分类器和矢量化器来预测给定姓名的国籍分
def predict_nationality(name, classifier, vectorizer):
    vectorized_name = vectorizer.vectorize(name)
    vectorized_name = torch.tensor(vectorized_name).view(1, -1)
    result = classifier(vectorized_name, apply_softmax=True)
 
    probability_values, indices = result.max(dim=1)
    index = indices.item()
 
    predicted_nationality = vectorizer.nationality_vocab.lookup_index(index)
    probability_value = probability_values.item()
 
    return {'nationality': predicted_nationality,
            'probability': probability_value}
 
#接收用户输入的姓氏,然后使用提供的分类器和矢量化器对该姓氏进行国籍分类预测,并打印出预测结果及其概率
new_surname = input("Enter a surname to classify: ")
classifier = classifier.to("cpu")
prediction = predict_nationality(new_surname, classifier, vectorizer)
print("{} -> {} (p={:0.2f})".format(new_surname,
                                    prediction['nationality'],
                                    prediction['probability']))
 
#从矢量化器的国籍词汇表中查找索引为 8 的国籍
vectorizer.nationality_vocab.lookup_index(8)
 
#使用了分类器和矢量化器来对给定的姓名进行国籍分类预测,并返回概率最高的前 k 个国籍预测结果
def predict_topk_nationality(name, classifier, vectorizer, k=5):
    vectorized_name = vectorizer.vectorize(name)
    vectorized_name = torch.tensor(vectorized_name).view(1, -1)
    prediction_vector = classifier(vectorized_name, apply_softmax=True)
    probability_values, indices = torch.topk(prediction_vector, k=k)
    
    probability_values = probability_values.detach().numpy()[0]
    indices = indices.detach().numpy()[0]
    
    results = []
    for prob_value, index in zip(probability_values, indices):
        nationality = vectorizer.nationality_vocab.lookup_index(index)
        results.append({'nationality': nationality, 
                        'probability': prob_value})
    
    return results
 
 
new_surname = input("Enter a surname to classify: ")
classifier = classifier.to("cpu")
 
k = int(input("How many of the top predictions to see? "))
if k > len(vectorizer.nationality_vocab):
    print("Sorry! That's more than the # of nationalities we have.. defaulting you to max size :)")
    k = len(vectorizer.nationality_vocab)
    
predictions = predict_topk_nationality(new_surname, classifier, vectorizer, k=k)
 
print("Top {} predictions:".format(k))
print("===================")
for prediction in predictions:
    print("{} -> {} (p={:0.2f})".format(new_surname,
                                        prediction['nationality'],
                                        prediction['probability']))

结果展示:

四.CNN简介

卷积神经网络(Convolutional Neural Network,CNN)是一种深度学习模型,特别适用于处理图像数据。它通过使用卷积层、池化层和全连接层来自动提取和学习图像特征,从而实现图像分类、目标检测和图像生成等任务。以下是CNN的一些关键组件和特点:

1. 卷积层(Convolutional Layer)

卷积层是CNN的核心组件,通过卷积操作提取图像的局部特征。卷积操作使用若干个卷积核(filter)在输入图像上滑动,每个卷积核生成一个特征图(feature map)。卷积层的参数包括卷积核的大小、步幅(stride)和填充(padding)。

2. 激活函数(Activation Function)

常用的激活函数是ReLU(Rectified Linear Unit),它将输入中的负值变为零,从而引入非线性特性,增加模型的表达能力。

3. 池化层(Pooling Layer)

池化层用于减小特征图的尺寸,从而减少计算量和防止过拟合。常见的池化操作包括最大池化(Max Pooling)和平均池化(Average Pooling)。

4. 全连接层(Fully Connected Layer)

在网络的最后,通常使用一个或多个全连接层,将前面提取的特征映射到最终的输出类别。全连接层的参数是权重和偏置,通过反向传播算法进行训练。

5. 正则化技术

为了防止过拟合,CNN中常用的正则化技术包括Dropout、数据增强和L2正则化等。

6. 优化算法

常用的优化算法包括随机梯度下降(SGD)、Adam和RMSprop等,这些算法通过调整学习率和其他参数来提高训练效率和稳定性。

7.CNN的工作流程

  1. 输入图像:接受原始图像作为输入。
  2. 卷积层和激活函数:通过卷积层和激活函数提取图像的局部特征。
  3. 池化层:减小特征图的尺寸,同时保留重要信息。
  4. 多次卷积和池化操作:反复进行卷积和池化操作,逐步提取更高层次的特征。
  5. 全连接层:将提取的特征映射到输出类别。
  6. 输出结果:生成最终的分类结果或其他任务的输出。

8.CNN的优势

  • 局部感受野:卷积操作关注局部区域,能够有效提取局部特征。
  • 参数共享:卷积核在整个图像上共享,减少了参数数量,提高了训练效率。
  • 空间不变性:通过池化操作和卷积核的滑动,CNN具有一定的平移不变性。

五.利用CNN实现姓氏分类

1.导入第三方库

from argparse import Namespace
from collections import Counter
import json
import os
import string
 
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm_notebook

2.数据集类的加载和处理文本

#创建一个自定义的数据集类来加载和处理文本数据
#划分训练、验证和测试集,设置数据集大小,以及构建查找字典
class SurnameDataset(Dataset):
    def __init__(self, surname_df, vectorizer):
        """
        Args:
            name_df (pandas.DataFrame): the dataset
            vectorizer (SurnameVectorizer): vectorizer instatiated from dataset
        """
        self.surname_df = surname_df
        self._vectorizer = vectorizer
        self.train_df = self.surname_df[self.surname_df.split=='train']
        self.train_size = len(self.train_df)
 
        self.val_df = self.surname_df[self.surname_df.split=='val']
        self.validation_size = len(self.val_df)
 
        self.test_df = self.surname_df[self.surname_df.split=='test']
        self.test_size = len(self.test_df)
 
        self._lookup_dict = {'train': (self.train_df, self.train_size),
                             'val': (self.val_df, self.validation_size),
                             'test': (self.test_df, self.test_size)}
 
        self.set_split('train')
        
        # Class weights
        class_counts = surname_df.nationality.value_counts().to_dict()
        def sort_key(item):
            return self._vectorizer.nationality_vocab.lookup_token(item[0])
        sorted_counts = sorted(class_counts.items(), key=sort_key)
        frequencies = [count for _, count in sorted_counts]
        self.class_weights = 1.0 / torch.tensor(frequencies, dtype=torch.float32)
 
#加载数据集并生成新的向量化器
    @classmethod
    def load_dataset_and_make_vectorizer(cls, surname_csv):
        """Load dataset and make a new vectorizer from scratch
        
        Args:
            surname_csv (str): location of the dataset
        Returns:
            an instance of SurnameDataset
        """
        surname_df = pd.read_csv(surname_csv)
        train_surname_df = surname_df[surname_df.split=='train']
        return cls(surname_df, SurnameVectorizer.from_dataframe(train_surname_df))
#加载数据集和相应的向量化器,用于重新使用已缓存的向量化器
    @classmethod
    def load_dataset_and_load_vectorizer(cls, surname_csv, vectorizer_filepath):
        """Load dataset and the corresponding vectorizer. 
        Used in the case in the vectorizer has been cached for re-use
        
        Args:
            surname_csv (str): location of the dataset
            vectorizer_filepath (str): location of the saved vectorizer
        Returns:
            an instance of SurnameDataset
        """
        surname_df = pd.read_csv(surname_csv)
        vectorizer = cls.load_vectorizer_only(vectorizer_filepath)
        return cls(surname_df, vectorizer)
#从文件中加载向量化器
    @staticmethod
    def load_vectorizer_only(vectorizer_filepath):
        """a static method for loading the vectorizer from file
        
        Args:
            vectorizer_filepath (str): the location of the serialized vectorizer
        Returns:
            an instance of SurnameDataset
        """
        with open(vectorizer_filepath) as fp:
            return SurnameVectorizer.from_serializable(json.load(fp))
#将向量化器保存到磁盘
    def save_vectorizer(self, vectorizer_filepath):
        """saves the vectorizer to disk using json
        
        Args:
            vectorizer_filepath (str): the location to save the vectorizer
        """
        with open(vectorizer_filepath, "w") as fp:
            json.dump(self._vectorizer.to_serializable(), fp)
#回向量化器对象
    def get_vectorizer(self):
        """ returns the vectorizer """
        return self._vectorizer
#划分数据集
    def set_split(self, split="train"):
        """ selects the splits in the dataset using a column in the dataframe """
        self._target_split = split
        self._target_df, self._target_size = self._lookup_dict[split]
#返回指定数据集的大小
    def __len__(self):
        return self._target_size
#根据指定的批量大小返回数据集中的批次数量
    def __getitem__(self, index):
        """the primary entry point method for PyTorch datasets
        
        Args:
            index (int): the index to the data point 
        Returns:
            a dictionary holding the data point's features (x_data) and label (y_target)
        """
        row = self._target_df.iloc[index]
 
        surname_matrix = \
            self._vectorizer.vectorize(row.surname)
 
        nationality_index = \
            self._vectorizer.nationality_vocab.lookup_token(row.nationality)
 
        return {'x_surname': surname_matrix,
                'y_nationality': nationality_index}
#生成批次数据  
    def get_num_batches(self, batch_size):
        """Given a batch size, return the number of batches in the dataset
        
        Args:
            batch_size (int)
        Returns:
            number of batches in the dataset
        """
        return len(self) // batch_size
 
    
def generate_batches(dataset, batch_size, shuffle=True,
                     drop_last=True, device="cpu"):
    """
    A generator function which wraps the PyTorch DataLoader. It will 
      ensure each tensor is on the write device location.
    """
    dataloader = DataLoader(dataset=dataset, batch_size=batch_size,
                            shuffle=shuffle, drop_last=drop_last)
 
    for data_dict in dataloader:
        out_data_dict = {}
        for name, tensor in data_dict.items():
            out_data_dict[name] = data_dict[name].to(device)
        yield out_data_dict

3.将姓氏字符串转换为向量化的minibatches

class Vocabulary(object):
    """用于处理文本并提取词汇的类"""
 
    def __init__(self, token_to_idx=None, add_unk=True, unk_token="<UNK>"):
        """
        初始化Vocabulary对象
        Args:
            token_to_idx (dict): 一个词汇到索引的映射字典
            add_unk (bool): 是否添加UNK标记的标志
            unk_token (str): 要添加到词汇表中的UNK标记
        """
 
        if token_to_idx is None:
            token_to_idx = {}
        self._token_to_idx = token_to_idx
 
        # 创建索引到词汇的映射
        self._idx_to_token = {idx: token 
                              for token, idx in self._token_to_idx.items()}
        
        self._add_unk = add_unk
        self._unk_token = unk_token
        
        self.unk_index = -1
        # 如果需要添加UNK标记,则添加它
        if add_unk:
            self.unk_index = self.add_token(unk_token) 
        
    def to_serializable(self):
        """返回一个可序列化的字典"""
        return {'token_to_idx': self._token_to_idx, 
                'add_unk': self._add_unk, 
                'unk_token': self._unk_token}
 
    @classmethod
    def from_serializable(cls, contents):
        """从一个可序列化的字典中实例化Vocabulary对象"""
        return cls(**contents)
 
    def add_token(self, token):
        """根据词汇更新映射字典
        
        Args:
            token (str): 要添加到词汇表中的词汇
        Returns:
            index (int): 词汇对应的整数索引
        """
        try:
            index = self._token_to_idx[token]
        except KeyError:
            index = len(self._token_to_idx)
            self._token_to_idx[token] = index
            self._idx_to_token[index] = token
        return index
    
    def add_many(self, tokens):
        """将一个词汇列表添加到词汇表中
        
        Args:
            tokens (list): 一个字符串词汇列表
        Returns:
            indices (list): 与词汇对应的整数索引列表
        """
        return [self.add_token(token) for token in tokens]
 
    def lookup_token(self, token):
        """检索与词汇关联的索引或UNK索引(如果词汇不存在)。
        
        Args:
            token (str): 要查找的词汇
        Returns:
            index (int): 与词汇对应的整数索引
        Notes:
            UNK功能需要unk_index >=0(已添加到词汇表中)
        """
        if self.unk_index >= 0:
            return self._token_to_idx.get(token, self.unk_index)
        else:
            return self._token_to_idx[token]
 
    def lookup_index(self, index):
        """返回与索引关联的词汇
        
        Args: 
            index (int): 要查找的索引
        Returns:
            token (str): 与索引对应的词汇
        Raises:
            KeyError: 如果索引不在词汇表中
        """
        if index not in self._idx_to_token:
            raise KeyError("索引(%d)不在词汇表中" % index)
        return self._idx_to_token[index]
 
    def __str__(self):
        return "<Vocabulary(size=%d)>" % len(self)
 
    def __len__(self):
        return len(self._token_to_idx)

class SurnameVectorizer(object):
    """姓氏矢量化器,协调词汇表并将其应用于数据"""
    
    def __init__(self, surname_vocab, nationality_vocab, max_surname_length):
        """
        Args:
            surname_vocab (Vocabulary): 将字符映射到整数的词汇表
            nationality_vocab (Vocabulary): 将国籍映射到整数的词汇表
            max_surname_length (int): 最长姓氏的长度
        """
        self.surname_vocab = surname_vocab
        self.nationality_vocab = nationality_vocab
        self._max_surname_length = max_surname_length
 
    def vectorize(self, surname):
        """
        Args:
            surname (str): 姓氏
        Returns:
            one_hot_matrix (np.ndarray): 一个独热向量矩阵
        """
        # 创建一个全零矩阵
        one_hot_matrix_size = (len(self.surname_vocab), self._max_surname_length)
        one_hot_matrix = np.zeros(one_hot_matrix_size, dtype=np.float32)
        
        # 将姓氏中的字符转换为独热向量
        for position_index, character in enumerate(surname):
            character_index = self.surname_vocab.lookup_token(character)
            one_hot_matrix[character_index][position_index] = 1
        
        return one_hot_matrix
 
    @classmethod
    def from_dataframe(cls, surname_df):
        """从数据框实例化矢量化器
        
        Args:
            surname_df (pandas.DataFrame): 姓氏数据集
        Returns:
            SurnameVectorizer的一个实例
        """
        surname_vocab = Vocabulary(unk_token="@")
        nationality_vocab = Vocabulary(add_unk=False)
        max_surname_length = 0
 
        for index, row in surname_df.iterrows():
            max_surname_length = max(max_surname_length, len(row.surname))
            for letter in row.surname:
                surname_vocab.add_token(letter)
            nationality_vocab.add_token(row.nationality)
 
        return cls(surname_vocab, nationality_vocab, max_surname_length)
 
    @classmethod
    def from_serializable(cls, contents):
        surname_vocab = Vocabulary.from_serializable(contents['surname_vocab'])
        nationality_vocab =  Vocabulary.from_serializable(contents['nationality_vocab'])
        return cls(surname_vocab=surname_vocab, nationality_vocab=nationality_vocab, 
                   max_surname_length=contents['max_surname_length'])
 
    def to_serializable(self):
        return {'surname_vocab': self.surname_vocab.to_serializable(),
                'nationality_vocab': self.nationality_vocab.to_serializable(), 
                'max_surname_length': self._max_surname_length}

4.用卷积神经网络进行姓氏分类

class SurnameClassifier(nn.Module):
    def __init__(self, initial_num_channels, num_classes, num_channels):
        """
        Args:
            initial_num_channels (int): 输入特征向量的大小
            num_classes (int): 输出预测向量的大小
            num_channels (int): 网络中使用的常数通道大小
        """
        super(SurnameClassifier, self).__init__()
        
        # 定义卷积网络层
        self.convnet = nn.Sequential(
            nn.Conv1d(in_channels=initial_num_channels, 
                      out_channels=num_channels, kernel_size=3),
            nn.ELU(),  # 使用ELU激活函数
            nn.Conv1d(in_channels=num_channels, out_channels=num_channels, 
                      kernel_size=3, stride=2),  # 步长为2的卷积层
            nn.ELU(),
            nn.Conv1d(in_channels=num_channels, out_channels=num_channels, 
                      kernel_size=3, stride=2),  # 步长为2的卷积层
            nn.ELU(),
            nn.Conv1d(in_channels=num_channels, out_channels=num_channels, 
                      kernel_size=3),  # 没有步长的卷积层
            nn.ELU()
        )
        
        # 全连接层,将卷积层的输出映射到预测向量的大小
        self.fc = nn.Linear(num_channels, num_classes)
 
    def forward(self, x_surname, apply_softmax=False):
        """分类器的前向传播
        
        Args:
            x_surname (torch.Tensor): 输入数据张量。
                x_surname.shape 应为 (batch, initial_num_channels, max_surname_length)
            apply_softmax (bool): softmax激活的标志
                如果与交叉熵损失一起使用,应为false
        Returns:
            结果张量。tensor.shape 应为 (batch, num_classes)
        """
        # 使用卷积网络进行特征提取
        features = self.convnet(x_surname).squeeze(dim=2)
       
        # 将提取的特征通过全连接层进行预测
        prediction_vector = self.fc(features)
 
        # 如果需要应用softmax激活函数,则进行softmax操作
        if apply_softmax:
            prediction_vector = F.softmax(prediction_vector, dim=1)
 
        return prediction_vector

5.辅助函数

#创建一个表示训练状态的字典,初始化各种参数和指标
def make_train_state(args):
    return {'stop_early': False,
            'early_stopping_step': 0,
            'early_stopping_best_val': 1e8,
            'learning_rate': args.learning_rate,
            'epoch_index': 0,
            'train_loss': [],
            'train_acc': [],
            'val_loss': [],
            'val_acc': [],
            'test_loss': -1,
            'test_acc': -1,
            'model_filename': args.model_state_file}
 
#处理训练状态的更新
def update_train_state(args, model, train_state):
    """Handle the training state updates.
    Components:
     - Early Stopping: Prevent overfitting.
     - Model Checkpoint: Model is saved if the model is better
    :param args: main arguments
    :param model: model to train
    :param train_state: a dictionary representing the training state values
    :returns:
        a new train_state
    """
 
    # 至少保存一个模型
    if train_state['epoch_index'] == 0:
        torch.save(model.state_dict(), train_state['model_filename'])
        train_state['stop_early'] = False
 
    # 性能得到改善,则保存模型
    elif train_state['epoch_index'] >= 1:
        loss_tm1, loss_t = train_state['val_loss'][-2:]
 
        #如果损失恶化
        if loss_t >= train_state['early_stopping_best_val']:
            # Update step
            train_state['early_stopping_step'] += 1
        # 损失减少
        else:
            # 保存最佳模型
            if loss_t < train_state['early_stopping_best_val']:
                torch.save(model.state_dict(), train_state['model_filename'])
 
            # 重置提前停止步骤
            train_state['early_stopping_step'] = 0
 
        # 早停?
        train_state['stop_early'] = \
            train_state['early_stopping_step'] >= args.early_stopping_criteria
 
    return train_state
 
#计算模型预测的准确率
def compute_accuracy(y_pred, y_target):
    _, y_pred_indices = y_pred.max(dim=1)
    n_correct = torch.eq(y_pred_indices, y_target).sum().item()
    return n_correct / len(y_pred_indices) * 100
 
 
#设置随机种子
def set_seed_everywhere(seed, cuda):
    np.random.seed(seed)
    torch.manual_seed(seed)
    if cuda:
        torch.cuda.manual_seed_all(seed)
#处理文件目录
def handle_dirs(dirpath):
    if not os.path.exists(dirpath):
        os.makedirs(dirpath)
 
 
args = Namespace(
    # 数据和路径信息
    surname_csv="surnames_with_splits.csv",
    vectorizer_file="vectorizer.json",
    model_state_file="model.pth",
    save_dir="model_storage/ch4/surname_mlp",
    # 模型超参数
    hidden_dim=300,
    # 训练超参数
    seed=1337,
    num_epochs=5,
    early_stopping_criteria=5,
    learning_rate=0.001,
    batch_size=64,
   # 运行时选项
    cuda=False,
    reload_from_files=False,
    expand_filepaths_to_save_dir=True,
)
 
if args.expand_filepaths_to_save_dir:
    args.vectorizer_file = os.path.join(args.save_dir,
                                        args.vectorizer_file)
 
    args.model_state_file = os.path.join(args.save_dir,
                                         args.model_state_file)
    
    print("Expanded filepaths: ")
    print("\t{}".format(args.vectorizer_file))
    print("\t{}".format(args.model_state_file))
    
# 检查CUDA
if not torch.cuda.is_available():
    args.cuda = False
 
args.device = torch.device("cuda" if args.cuda else "cpu")
    
print("Using CUDA: {}".format(args.cuda))
 
 
# 为可重复性奠定种子
set_seed_everywhere(args.seed, args.cuda)
 
#句柄目录
handle_dirs(args.save_dir)
 
#确定是重新加载已有的数据集和词向量化器,还是创建新的数据集和词向量化器。
if args.reload_from_files:
    # training from a checkpoint
    print("Reloading!")
    dataset = SurnameDataset.load_dataset_and_load_vectorizer(args.surname_csv,
                                                              args.vectorizer_file)
else:
    # create dataset and vectorizer
    print("Creating fresh!")
    dataset = SurnameDataset.load_dataset_and_make_vectorizer(args.surname_csv)
    dataset.save_vectorizer(args.vectorizer_file)
    
vectorizer = dataset.get_vectorizer()
classifier = SurnameClassifier(input_dim=len(vectorizer.surname_vocab), 
                               hidden_dim=args.hidden_dim, 
                               output_dim=len(vectorizer.nationality_vocab))

6.模型训练

epoch_bar = tqdm_notebook(desc='训练过程',  # 创建一个进度条显示训练过程
                          total=args.num_epochs,  # 设置总的迭代次数
                          position=0)  # 设置进度条位置
 
# 将数据集设置为训练集并创建一个训练集进度条
dataset.set_split('train')
train_bar = tqdm_notebook(desc='训练集',  # 创建一个显示训练集进度的进度条
                          total=dataset.get_num_batches(args.batch_size), 
                          position=1, 
                          leave=True)
# 将数据集设置为验证集并创建一个验证集进度条
dataset.set_split('val')
val_bar = tqdm_notebook(desc='验证集',  # 创建一个显示验证集进度的进度条
                        total=dataset.get_num_batches(args.batch_size), 
                        position=1, 
                        leave=True)
 
try:
    for epoch_index in range(args.num_epochs):  # 循环迭代每个epoch
        train_state['epoch_index'] = epoch_index
 
        # 遍历训练数据集
 
        # 设置:批次生成器,将损失和准确率设置为0,设置为训练模式
        dataset.set_split('train')
        batch_generator = generate_batches(dataset, 
                                           batch_size=args.batch_size, 
                                           device=args.device)
        running_loss = 0.0
        running_acc = 0.0
        classifier.train()  # 设置模型为训练模式
 
        for batch_index, batch_dict in enumerate(batch_generator):
            # 训练过程的5个步骤:
 
            # --------------------------------------
            # 步骤1. 梯度清零
            optimizer.zero_grad()
 
            # 步骤2. 计算输出
            y_pred = classifier(batch_dict['x_surname'])
 
            # 步骤3. 计算损失
            loss = loss_func(y_pred, batch_dict['y_nationality'])
            loss_t = loss.item()
            running_loss += (loss_t - running_loss) / (batch_index + 1)
 
            # 步骤4. 使用损失计算梯度
            loss.backward()
 
            # 步骤5. 使用优化器更新参数
            optimizer.step()
            # -----------------------------------------
            # 计算准确率
            acc_t = compute_accuracy(y_pred, batch_dict['y_nationality'])
            running_acc += (acc_t - running_acc) / (batch_index + 1)
 
            # 更新进度条
            train_bar.set_postfix(loss=running_loss, acc=running_acc, 
                            epoch=epoch_index)
            train_bar.update()
 
        train_state['train_loss'].append(running_loss)
        train_state['train_acc'].append(running_acc)
 
        # 遍历验证数据集
 
        # 设置:批次生成器,将损失和准确率设置为0,设置为评估模式
        dataset.set_split('val')
        batch_generator = generate_batches(dataset, 
                                           batch_size=args.batch_size, 
                                           device=args.device)
        running_loss = 0.
        running_acc = 0.
        classifier.eval()  # 设置模型为评估模式
 
        for batch_index, batch_dict in enumerate(batch_generator):
 
            # 计算输出
            y_pred =  classifier(batch_dict['x_surname'])
 
            # 步骤3. 计算损失
            loss = loss_func(y_pred, batch_dict['y_nationality'])
            loss_t = loss.item()
            running_loss += (loss_t - running_loss) / (batch_index + 1)
 
            # 计算准确率
            acc_t = compute_accuracy(y_pred, batch_dict['y_nationality'])
            running_acc += (acc_t - running_acc) / (batch_index + 1)
            val_bar.set_postfix(loss=running_loss, acc=running_acc, 
                            epoch=epoch_index)
            val_bar.update()
 
        train_state['val_loss'].append(running_loss)
        train_state['val_acc'].append(running_acc)
 
        train_state = update_train_state(args=args, model=classifier,
                                         train_state=train_state)
 
        scheduler.step(train_state['val_loss'][-1])
 
        if train_state['stop_early']:
            break
 
        train_bar.n = 0
        val_bar.n = 0
        epoch_bar.update()
except KeyboardInterrupt:
    print("Exiting loop")  

训练过程展示:

7.得到的最佳模型

# 加载模型权重
classifier.load_state_dict(torch.load(train_state['model_filename']))
 
# 将模型移动到指定的设备上
classifier = classifier.to(args.device)
# 将类别权重也移动到指定的设备上
dataset.class_weights = dataset.class_weights.to(args.device)
# 使用交叉熵损失函数,并考虑类别权重
loss_func = nn.CrossEntropyLoss(dataset.class_weights)
 
# 设置数据集为测试集,并生成批次数据
dataset.set_split('test')
batch_generator = generate_batches(dataset, 
                                   batch_size=args.batch_size, 
                                   device=args.device)
running_loss = 0.
running_acc = 0.
classifier.eval()  # 设置模型为评估模式
 
# 遍历测试数据集
for batch_index, batch_dict in enumerate(batch_generator):
    # 计算模型输出
    y_pred =  classifier(batch_dict['x_surname'])
    
    # 计算损失
    loss = loss_func(y_pred, batch_dict['y_nationality'])
    loss_t = loss.item()
    running_loss += (loss_t - running_loss) / (batch_index + 1)
 
    # 计算准确率
    acc_t = compute_accuracy(y_pred, batch_dict['y_nationality'])
    running_acc += (acc_t - running_acc) / (batch_index + 1)
 
# 将测试结果保存到训练状态中
train_state['test_loss'] = running_loss
train_state['test_acc'] = running_acc
#输出损失和准确率
print("Test loss: {};".format(train_state['test_loss']))
print("Test Accuracy: {}".format(train_state['test_acc']))

训练集上的损失和准确率:

8. 预测结果

def predict_nationality(surname, classifier, vectorizer):
    """预测一个新姓氏的国籍
    
    Args:
        surname (str): 待分类的姓氏
        classifier (SurnameClassifer): 分类器的实例
        vectorizer (SurnameVectorizer): 对应的矢量化器
    Returns:
        包含最可能的国籍及其概率的字典
    """
    # 将姓氏进行矢量化
    vectorized_surname = vectorizer.vectorize(surname)
    # 将矢量化后的姓氏转换为张量,并增加一个维度
    vectorized_surname = torch.tensor(vectorized_surname).unsqueeze(0)
    # 使用分类器进行预测,并应用softmax函数
    result = classifier(vectorized_surname, apply_softmax=True)
 
    # 获取概率最大的值及其索引
    probability_values, indices = result.max(dim=1)
    index = indices.item()
 
    # 根据索引查找对应的国籍
    predicted_nationality = vectorizer.nationality_vocab.lookup_index(index)
    probability_value = probability_values.item()
 
    return {'nationality': predicted_nationality, 'probability': probability_value}
new_surname = input("Enter a surname to classify: ")  # 输入待分类的姓氏
classifier = classifier.cpu()  # 将分类器移动到CPU上进行预测
prediction = predict_nationality(new_surname, classifier, vectorizer)  # 预测姓氏的国籍
print("{} -> {} (p={:0.2f})".format(new_surname,  # 打印预测结果
                                    prediction['nationality'],
                                    prediction['probability']))
 
def predict_topk_nationality(surname, classifier, vectorizer, k=5):
    """预测一个新姓氏的前K个国籍
    
    Args:
        surname (str): 待分类的姓氏
        classifier (SurnameClassifer): 分类器的实例
        vectorizer (SurnameVectorizer): 对应的矢量化器
        k (int): 要返回的前K个国籍的数量
    Returns:
        包含字典的列表,每个字典代表一个国籍及其概率
    """
    
    # 将姓氏进行矢量化
    vectorized_surname = vectorizer.vectorize(surname)
    vectorized_surname = torch.tensor(vectorized_surname).unsqueeze(dim=0)
    # 获取预测向量
    prediction_vector = classifier(vectorized_surname, apply_softmax=True)
    # 获取概率最高的K个值和对应的索引
    probability_values, indices = torch.topk(prediction_vector, k=k)
    
    # 将结果转换为numpy数组
    probability_values = probability_values[0].detach().numpy()
    indices = indices[0].detach().numpy()
    
    results = []
    # 遍历获取前K个预测结果
    for kth_index in range(k):
        nationality = vectorizer.nationality_vocab.lookup_index(indices[kth_index])
        probability_value = probability_values[kth_index]
        results.append({'nationality': nationality, 
                        'probability': probability_value})
    return results
 
new_surname = input("Enter a surname to classify: ")  # 输入待分类的姓氏
 
k = int(input("How many of the top predictions to see? "))  # 选择要查看的前K个预测结果
if k > len(vectorizer.nationality_vocab):
    print("Sorry! That's more than the # of nationalities we have.. defaulting you to max size :)")
    k = len(vectorizer.nationality_vocab)
    
# 获取前K个预测结果
predictions = predict_topk_nationality(new_surname, classifier, vectorizer, k=k)
 
print("Top {} predictions:".format(k))
print("===================")
for prediction in predictions:
    # 打印每个预测结果
    print("{} -> {} (p={:0.2f})".format(new_surname,
                                        prediction['nationality'],
                                        prediction['probability']))

结果展示:

  • 14
    点赞
  • 23
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值