自然语言处理之使用卷积神经网络实现姓氏分类任务

最新推荐文章于 2024-08-14 20:51:06 发布

sparrow0053

最新推荐文章于 2024-08-14 20:51:06 发布

阅读量1.8k

点赞数 40

文章标签：自然语言处理 cnn 分类

本文链接：https://blog.csdn.net/weixin_64845802/article/details/139886705

版权

使用CNN模型实现姓氏分类任务

1.实验内容简介
2.卷积神经网络
3.实验步骤

1.实验内容简介

使用CNN(卷积神经网络)模型实现根据姓氏预测国籍的分类任务。

2.卷积神经网络

2.1 卷积及相关

卷积运算是一种数学运算，表示信号或函数与卷积核的结合。
在深度学习中，卷积运算是卷积神经网络（CNN）的核心组件。它被广泛应用于图像处理、自然语言处理和时序数据分析中。

2.1.1 一维卷积

一维卷积运算常用于处理一维数据，如时间序列或音频信号。其基本原理是将一个卷积核（滤波器）在输入数据上滑动，并计算点积。
其数学表达为：
$y\left [n \right ] =(x*h)[n]=\sum_{k=0}^{K-1} x[n-k]\cdot h[k]$
其中 $x$ 为输入信号， $h$ 为卷积核。
一维卷积的计算过程可以概括为：卷积核从输入数据的起始位置开始滑动，计算卷积核与输入数据当前窗口的点积；滑动卷积核到下一个位置，重复计算。
如图1所示，绿色部分为卷积核，黄色部分为输入序列，蓝色部分为一维卷积运算结果。
一维卷积的代码实现演示：

import numpy as np

def convolve1d(signal, kernel):
    kernel_size = len(kernel)
    signal_size = len(signal)
    output_size = signal_size - kernel_size + 1
    output = np.zeros(output_size)
    
    for i in range(output_size):
        output[i] = np.dot(signal[i:i+kernel_size], kernel)
        
    return output

signal = np.array([1, 2, 3, 4, 5])
kernel = np.array([1, 0, -1])
output = convolve1d(signal, kernel)
print("一维卷积结果：", output)

2.1.2 二维卷积

二维卷积运算主要用于图像处理。它通过在二维矩阵（图像）上滑动卷积核来进行卷积运算。
其数学表达为：
$Y[i,j]=(X*K)[i,j]=\sum_{m=0}^{M-1}\sum_{n=0}^{N-1}X[i+m,j+n]\cdot K[m,n]$
其中 $X$ 为输入矩阵， $K$ 为卷积核。
二维卷积的运算过程可以概括为：卷积从输入矩阵的左上角开始滑动，计算卷积核与输入矩阵当前窗口的点积；滑动卷积核到下一个位置，重复计算。
如图二所示，黑色框为输入矩阵，蓝色框为二维卷积核，红色框为二维卷积运算结果。
二维卷积的代码实现演示：

def convolve2d(image, kernel):
    kernel_height, kernel_width = kernel.shape
    image_height, image_width = image.shape
    output_height = image_height - kernel_height + 1
    output_width = image_width - kernel_width + 1
    output = np.zeros((output_height, output_width))
    
    for i in range(output_height):
        for j in range(output_width):
            output[i, j] = np.sum(image[i:i+kernel_height, j:j+kernel_width] * kernel)
    
    return output

image = np.array([[1, 2, 3],
                  [4, 5, 6],
                  [7, 8, 9]])
kernel = np.array([[1, 0],
                   [0, -1]])
output = convolve2d(image, kernel)
print("二维卷积结果：\n", output)

二维卷积按照填充方式的不同可以分为：全卷积(Full Convolution)、同尺寸卷积(Same Convolution)、有效卷积(Valid Convolution)。
如图3，Full Convolution即全卷积模式下，在输入矩阵的边缘进行完全填充，使得卷积核能够完全覆盖输入矩阵的每一个元素，包括边缘元素。全卷积的输出尺寸大于输入尺寸。

如图4，Same Convolution即在同尺寸卷积模式下，通过在输入矩阵的边缘填充适当数量的零，使得输出矩阵的尺寸与输入矩阵的尺寸相同（步幅为1的情况下）。
在这里插入图片描述

如图5，Valid Convolution即在有效卷积模式下，不进行任何填充。卷积核完全在输入矩阵内部滑动，这会导致输出矩阵的尺寸小于输入矩阵。
在这里插入图片描述

步长：卷积操作中，卷积核可以按照指定的间隔进行卷积操作，这个间隔就是卷积步长。如图6为卷积步长取1时，图7为卷积步长取2时。

2.2卷积神经网络简介

一个卷积神经网络通常包含输入层、卷积层、激活函数层、池化层、全连接层、批量归一化层、DropOut层、输出层。一个卷积神经网络通常包含多个卷积层、激活层与池化层。

输入层：是网络的起点，负责接收原始数据，例如图像或文本。
卷积层：是CNN的核心，用于提取输入数据的局部特征。通过卷积运算，可以捕捉图像中的边缘、纹理等信息。常用卷积层参数包括卷积核大小、步长、填充等。
激活函数层：引入非线性，使网络能够学习和表达更复杂的函数关系。常用的激活函数包括：ReLu，LeakyReLu，Tanh，sigmoid。
池化层：用于下采样特征图，减少计算量，防止过拟合。常见的池化方式包括：最大池化、平均池化。如图8为使用最大池化的一个简单示例。
全连接层将前面提取到的特征映射到样本标签空间，用于最终的分类或回归任务。
输出层：产生最终的输出结果，根据任务选择不同的激活函数和损失函数。
批量归一化层在每个小批量数据上进行归一化，使得每层的输入保持稳定，有助于加速训练和提高模型的稳定性。
DropOut层在每次训练迭代中，以一定概率（如0.5）随机“丢弃”一些神经元，防止模型过于依赖某些局部特征即有效防止过拟合现象的产生。如图9为DropOut的工作示意图。

思考为什么DropOut可以有效防止过拟合呢？
一方面，随即失活使得每次更新梯度时参与计算的网络参数减少了，降低了模型容量，所以能够防止过拟合。另一方面，随机失活鼓励权重分散，从这个角度看，随机失活起到了正则化的作用，进而防止过拟合。此外，DropOut可以看作模型的集成。

3.实验步骤

简要介绍了卷积神经网络的相关知识后，下面开始本次实验的主要内容：使用卷积神经网络实现姓氏分类任务。

3.1 The Surname Dataset

尽管我们使用了和“使用多层感知器进行姓氏分类任务”相同数据集，但在实现上有一个不同之处:数据集由onehot向量矩阵组成，而不是一个收缩的onehot向量。为此，我们实现了一个数据集类，它跟踪最长的姓氏，并将其作为矩阵中包含的行数提供给矢量化器。列的数量是onehot向量的大小(词汇表的大小)。

我们使用数据集中最长的姓氏来控制onehot矩阵的大小有两个原因。首先，将每一小批姓氏矩阵组合成一个三维张量，要求它们的大小相同。其次，使用数据集中最长的姓氏意味着可以以相同的方式处理每个小批处理。

class SurnameDataset(Dataset):
    def __init__(self, surname_df, vectorizer):
        """
        Args:
            surname_df (pandas.DataFrame): the dataset
            vectorizer (SurnameVectorizer): vectorizer instatiated from dataset
        """
        self.surname_df = surname_df
        self._vectorizer = vectorizer
        
        #训练集、验证集、测试集
        self.train_df = self.surname_df[self.surname_df.split=='train']
        self.train_size = len(self.train_df)

        self.val_df = self.surname_df[self.surname_df.split=='val']
        self.validation_size = len(self.val_df)

        self.test_df = self.surname_df[self.surname_df.split=='test']
        self.test_size = len(self.test_df)
        
        #创建一个查找字典以便快速设置数据划分
        self._lookup_dict = {'train': (self.train_df, self.train_size),
                             'val': (self.val_df, self.validation_size),
                             'test': (self.test_df, self.test_size)}

        self.set_split('train')
        
        # Class weights
        #计算每个类别的频率并生成类别权重，用于样本不平衡问题
        class_counts = surname_df.nationality.value_counts().to_dict()
        def sort_key(item):
            return self._vectorizer.nationality_vocab.lookup_token(item[0])
        sorted_counts = sorted(class_counts.items(), key=sort_key)
        frequencies = [count for _, count in sorted_counts]
        self.class_weights = 1.0 / torch.tensor(frequencies, dtype=torch.float32)

    @classmethod
    #加载数据集并创建新的向量化器
    def load_dataset_and_make_vectorizer(cls, surname_csv):
        """Load dataset and make a new vectorizer from scratch
        
        Args:
            surname_csv (str): location of the dataset
        Returns:
            an instance of SurnameDataset
        """
        surname_df = pd.read_csv(surname_csv)
        train_surname_df = surname_df[surname_df.split=='train']
        return cls(surname_df, SurnameVectorizer.from_dataframe(train_surname_df))

    @classmethod
    #加载数据集和已缓存的向量化器
    def load_dataset_and_load_vectorizer(cls, surname_csv, vectorizer_filepath):
        """Load dataset and the corresponding vectorizer. 
        Used in the case in the vectorizer has been cached for re-use
        
        Args:
            surname_csv (str): location of the dataset
            vectorizer_filepath (str): location of the saved vectorizer
        Returns:
            an instance of SurnameDataset
        """
        surname_df = pd.read_csv(surname_csv)
        vectorizer = cls.load_vectorizer_only(vectorizer_filepath)
        return cls(surname_df, vectorizer)

    @staticmethod
    def load_vectorizer_only(vectorizer_filepath):
        """a static method for loading the vectorizer from file
        
        Args:
            vectorizer_filepath (str): the location of the serialized vectorizer
        Returns:
            an instance of SurnameVectorizer
        """
        with open(vectorizer_filepath) as fp:
            return SurnameVectorizer.from_serializable(json.load(fp))

    def save_vectorizer(self, vectorizer_filepath):
        """saves the vectorizer to disk using json
        
        Args:
            vectorizer_filepath (str): the location to save the vectorizer
        """
        with open(vectorizer_filepath, "w") as fp:
            json.dump(self._vectorizer.to_serializable(), fp)

    def get_vectorizer(self):
        """ returns the vectorizer """
        return self._vectorizer

    def set_split(self, split="train"):
        """ selects the splits in the dataset using a column in the dataframe """
        self._target_split = split
        self._target_df, self._target_size = self._lookup_dict[split]

    def __len__(self):
        return self._target_size
    
    #根据索引获取数据样本
    def __getitem__(self, index):
        """the primary entry point method for PyTorch datasets
        
        Args:
            index (int): the index to the data point 
        Returns:
            a dictionary holding the data point's:
                features (x_surname)
                label (y_nationality)
        """
        row = self._target_df.iloc[index]

        surname_vector = \
            self._vectorizer.vectorize(row.surname)

        nationality_index = \
            self._vectorizer.nationality_vocab.lookup_token(row.nationality)

        return {'x_surname': surname_vector,
                'y_nationality': nationality_index}

    #计算批次总数
    def get_num_batches(self, batch_size):
        """Given a batch size, return the number of batches in the dataset
        
        Args:
            batch_size (int)
        Returns:
            number of batches in the dataset
        """
        return len(self) // batch_size

#生成器函数，用于根据指定的批次大小、是否打乱数据等设置，生成每个所需的数据
def generate_batches(dataset, batch_size, shuffle=True,
                     drop_last=True, device="cpu"): 
    """
    A generator function which wraps the PyTorch DataLoader. It will 
      ensure each tensor is on the write device location.
    """
    dataloader = DataLoader(dataset=dataset, batch_size=batch_size,
                            shuffle=shuffle, drop_last=drop_last)

    for data_dict in dataloader:
        out_data_dict = {}
        for name, tensor in data_dict.items():
            out_data_dict[name] = data_dict[name].to(device)
        yield out_data_dict

3.2 Vocabulary, Vectorizer, and DataLoader

在本例中，词汇表和DataLoader的实现方式与“使用多层感知器进行姓氏分类”相同，但Vectorizer的vectorize()方法已经更改，以适应CNN模型的需要。
具体来说，我们要将字符串中的每个字符映射到一个整数，然后使用该整数构造一个由onehot向量组成的矩阵。重要的是，矩阵中的每一列都是不同的onehot向量。主要原因是，我们将使用的Conv1d层要求数据张量在第0维上具有批处理，在第1维上具有通道，在第2维上具有特性。

除了更改为使用onehot矩阵之外，我们还修改了矢量化器，以便计算姓氏的最大长度并将其保存为max_surname_length

class Vocabulary(object):
    """Class to process text and extract vocabulary for mapping"""

    def __init__(self, token_to_idx=None, add_unk=True, unk_token="<UNK>"):
        """
        Args:
            token_to_idx (dict): a pre-existing map of tokens to indices   #令牌到索引映射字典
            add_unk (bool): a flag that indicates whether to add the UNK token    #是否添加未知标记
            unk_token (str): the UNK token to add into the Vocabulary    #未知标记的符号
        """
        #如果没有提供初始映射，则创建一个空字典
        if token_to_idx is None:
            token_to_idx = {}
        self._token_to_idx = token_to_idx
        
        #创建索引到令牌的反向映射
        self._idx_to_token = {idx: token 
                              for token, idx in self._token_to_idx.items()}
        #是否添加UNK标记的标志和UNK标记本身
        self._add_unk = add_unk
        self._unk_token = unk_token
        #初始化UNK标记的索引为-1
        self.unk_index = -1
        #添加UNK标记到词汇表中
        if add_unk:
            self.unk_index = self.add_token(unk_token) 
        
    #生成可序列化的字典形式
    def to_serializable(self):
        """ returns a dictionary that can be serialized """
        return {'token_to_idx': self._token_to_idx, 
                'add_unk': self._add_unk, 
                'unk_token': self._unk_token}

    @classmethod
    #根据保存的内容重新创建对象
    def from_serializable(cls, contents):
        """ instantiates the Vocabulary from a serialized dictionary """
        return cls(**contents)
    
    #根据令牌更新映射字典，并返回其索引
    def add_token(self, token):
        """Update mapping dicts based on the token.

        Args:
            token (str): the item to add into the Vocabulary
        Returns:
            index (int): the integer corresponding to the token
        """
        try:
            index = self._token_to_idx[token]    #如果令牌已存在，则直接返回其索引
        except KeyError:
            #如果令牌不存在，则分配新的索引并添加到映射中
            index = len(self._token_to_idx)
            self._token_to_idx[token] = index
            self._idx_to_token[index] = token
        return index
    #批量添加令牌列表到词汇表
    def add_many(self, tokens):
        """Add a list of tokens into the Vocabulary
        
        Args:
            tokens (list): a list of string tokens
        Returns:
            indices (list): a list of indices corresponding to the tokens
        """
        #对列表中的每个令牌调用add_token方法，并收集返回的索引
        return [self.add_token(token) for token in tokens]
    
    #查找令牌对应的索引，如果令牌不在词汇表中，则返回UNK的索引。
    def lookup_token(self, token):
        """Retrieve the index associated with the token 
          or the UNK index if token isn't present.
        
        Args:
            token (str): the token to look up 
        Returns:
            index (int): the index corresponding to the token
        Notes:
            `unk_index` needs to be >=0 (having been added into the Vocabulary) 
              for the UNK functionality 
        """
        if self.unk_index >= 0:
            return self._token_to_idx.get(token, self.unk_index)
        else:
            return self._token_to_idx[token]
        
    #根据索引查找并返回对应的令牌
    def lookup_index(self, index):
        """Return the token associated with the index
        
        Args: 
            index (int): the index to look up
        Returns:
            token (str): the token corresponding to the index
        Raises:
            KeyError: if the index is not in the Vocabulary
        """
        if index not in self._idx_to_token:
            raise KeyError("the index (%d) is not in the Vocabulary" % index)
        return self._idx_to_token[index]

    def __str__(self):
        return "<Vocabulary(size=%d)>" % len(self)

    #返回词汇表中令牌的数量
    def __len__(self):
        return len(self._token_to_idx)


class SurnameVectorizer(object):
    """ 协调词汇表并将其投入使用的向量化器 """

    def __init__(self, surname_vocab, nationality_vocab, max_surname_length):
        """
        Args:
            surname_vocab (Vocabulary): 将字符映射到整数的词汇表
            nationality_vocab (Vocabulary): 将国籍映射到整数的词汇表
            max_surname_length (int): 最长姓氏的长度
        """
        self.surname_vocab = surname_vocab
        self.nationality_vocab = nationality_vocab
        self._max_surname_length = max_surname_length

    def vectorize(self, surname):
        """
        Args:
            surname (str): 姓氏
        Returns:
            one_hot_matrix (np.ndarray): 一个独热向量矩阵
        """
        one_hot_matrix_size = (len(self.surname_vocab), self._max_surname_length)
        one_hot_matrix = np.zeros(one_hot_matrix_size, dtype=np.float32)

        for position_index, character in enumerate(surname):
            character_index = self.surname_vocab.lookup_token(character)
            one_hot_matrix[character_index][position_index] = 1

        return one_hot_matrix

    @classmethod
    def from_dataframe(cls, surname_df):
        """从数据集DataFrame实例化向量化器
        
        Args:
            surname_df (pandas.DataFrame): 姓氏数据集
        Returns:
            SurnameVectorizer的实例
        """
        surname_vocab = Vocabulary(unk_token="@")
        nationality_vocab = Vocabulary(add_unk=False)
        max_surname_length = 0

        for index, row in surname_df.iterrows():
            max_surname_length = max(max_surname_length, len(row.surname))
            for letter in row.surname:
                surname_vocab.add_token(letter)
            nationality_vocab.add_token(row.nationality)

        return cls(surname_vocab, nationality_vocab, max_surname_length)

    @classmethod
    def from_serializable(cls, contents):
        surname_vocab = Vocabulary.from_serializable(contents['surname_vocab'])
        nationality_vocab = Vocabulary.from_serializable(contents['nationality_vocab'])
        return cls(surname_vocab=surname_vocab, nationality_vocab=nationality_vocab, 
                   max_surname_length=contents['max_surname_length'])

    def to_serializable(self):
        return {
            'surname_vocab': self.surname_vocab.to_serializable(),
            'nationality_vocab': self.nationality_vocab.to_serializable(),
            'max_surname_length': self._max_surname_length
        }

3.3 Reimplementing the SurnameClassifier with Convolutional Networks

模型使用了四个卷积层，每个卷积层后接一个ELU激活函数。
第一个卷积层的输入通道数为 initial_num_channels，输出通道数为 num_channels，卷积核大小为 3。第二和第三个卷积层的输入和输出通道数都是 num_channels，卷积核大小为 3，并且使用步长为 2 进行卷积操作，以减少特征图的大小。第四个卷积层的输入和输出通道数仍然是 num_channels，卷积核大小为 3。
卷积层提取的特征通过全连接层进行分类，将输入特征映射到 num_classes 个类别的预测向量。

import torch.nn as nn
import torch.nn.functional as F

class SurnameClassifier(nn.Module):
    def __init__(self, initial_num_channels, num_classes, num_channels):
        """
        Args:
            initial_num_channels (int): size of the incoming feature vector
            num_classes (int): size of the output prediction vector
            num_channels (int): constant channel size to use throughout network
        """
        super(SurnameClassifier, self).__init__()

        self.convnet = nn.Sequential(
            nn.Conv1d(in_channels=initial_num_channels,
                      out_channels=num_channels, kernel_size=3),
            nn.ELU(),
            nn.Conv1d(in_channels=num_channels, out_channels=num_channels,
                      kernel_size=3, stride=2),
            nn.ELU(),
            nn.Conv1d(in_channels=num_channels, out_channels=num_channels,
                      kernel_size=3, stride=2),
            nn.ELU(),
            nn.Conv1d(in_channels=num_channels, out_channels=num_channels,
                      kernel_size=3),
            nn.ELU()
        )
        self.fc = nn.Linear(num_channels, num_classes)

    def forward(self, x_surname, apply_softmax=False):
        """The forward pass of the classifier

        Args:
            x_surname (torch.Tensor): an input data tensor.
                x_surname.shape should be (batch, initial_num_channels,
                                           max_surname_length)
            apply_softmax (bool): a flag for the softmax activation
                should be false if used with the Cross Entropy losses
        Returns:
            the resulting tensor. tensor.shape should be (batch, num_classes)
        """
        features = self.convnet(x_surname).squeeze(dim=2)
        prediction_vector = self.fc(features)

        if apply_softmax:
            prediction_vector = F.softmax(prediction_vector, dim=1)

        return prediction_vector

3.4 The Training Routine

训练流程与“使用多层感知机实现姓氏分类”任务相同，包括实例化数据集,实例化模型,实例化损失函数,实例化优化器,遍历数据集的训练分区和更新模型参数,遍历数据集的验证分区和测量性能,然后重复迭代直到满足早停条件或达到预设的最大训练轮次。

3.4.1 Helper functions, settings and some prep work.

#初始化训练状态字典
def make_train_state(args):
    return {'stop_early': False,    #早停标志
            'early_stopping_step': 0,
            'early_stopping_best_val': 1e8,
            'learning_rate': args.learning_rate,
            'epoch_index': 0,
            'train_loss': [],
            'train_acc': [],
            'val_loss': [],
            'val_acc': [],
            'test_loss': -1,
            'test_acc': -1,
            'model_filename': args.model_state_file}
           
#在训练过程中更新训练状态
def update_train_state(args, model, train_state):
    """Handle the training state updates.

    Components:
     - Early Stopping: Prevent overfitting.
     - Model Checkpoint: Model is saved if the model is better

    :param args: main arguments
    :param model: model to train
    :param train_state: a dictionary representing the training state values
    :returns:
        a new train_state
    """

    # Save one model at least
    if train_state['epoch_index'] == 0:
        torch.save(model.state_dict(), train_state['model_filename'])
        train_state['stop_early'] = False

    # Save model if performance improved
    # 如果性能提高，则保存模型
    elif train_state['epoch_index'] >= 1:
        loss_tm1, loss_t = train_state['val_loss'][-2:]

        # If loss worsened
        #如果损失变大
        if loss_t >= train_state['early_stopping_best_val']:
            # Update step
            train_state['early_stopping_step'] += 1
        # Loss decreased
        #如果损失减小
        else:
            # Save the best model
            if loss_t < train_state['early_stopping_best_val']:
                torch.save(model.state_dict(), train_state['model_filename'])

            # Reset early stopping step
            train_state['early_stopping_step'] = 0

        # Stop early ?
        #判断是否提前停止训练
        train_state['stop_early'] = \
            train_state['early_stopping_step'] >= args.early_stopping_criteria

    return train_state

#计算准确率
def compute_accuracy(y_pred, y_target):
    _, y_pred_indices = y_pred.max(dim=1)
    n_correct = torch.eq(y_pred_indices, y_target).sum().item()
    return n_correct / len(y_pred_indices) * 100

args = Namespace(
    # Data and path information
    #数据与路径信息
    surname_csv="data/surnames/surnames_with_splits.csv",
    vectorizer_file="vectorizer.json",
    model_state_file="model.pth",
    save_dir="model_storage/ch4/cnn",
    # Model hyper parameters
    #模型超参数
    hidden_dim=100,
    num_channels=256,
    # Training  hyper parameters
    #训练超参数
    seed=1337,
    num_epochs=100,
    early_stopping_criteria=5,
    learning_rate=0.001,
    batch_size=128,
    # Runtime options
    cuda=False,
    reload_from_files=False,
    expand_filepaths_to_save_dir=True,
    catch_keyboard_interrupt=True
)


#如果需要将文件路径扩展到保存目录
if args.expand_filepaths_to_save_dir:
    args.vectorizer_file = os.path.join(args.save_dir,
                                        args.vectorizer_file)

    args.model_state_file = os.path.join(args.save_dir,
                                         args.model_state_file)
    
    print("Expanded filepaths: ")
    print("\t{}".format(args.vectorizer_file))
    print("\t{}".format(args.model_state_file))
    
# Check CUDA
#检查cuda是否可用
if not torch.cuda.is_available():
    args.cuda = False

args.device = torch.device("cuda" if args.cuda else "cpu")
    
print("Using CUDA: {}".format(args.cuda))

#随机种子
def set_seed_everywhere(seed, cuda):
    np.random.seed(seed)
    torch.manual_seed(seed)
    if cuda:
        torch.cuda.manual_seed_all(seed)

#处理目录，确保目标目录存在
def handle_dirs(dirpath):
    if not os.path.exists(dirpath):
        os.makedirs(dirpath)
        
# Set seed for reproducibility
set_seed_everywhere(args.seed, args.cuda)

#处理目录
handle_dirs(args.save_dir)

3.4.2 Initiation

if args.reload_from_files:
    # 如果命令行参数指定了从文件加载，则从检查点加载数据集和向量器
    dataset = SurnameDataset.load_dataset_and_load_vectorizer(args.surname_csv,
                                                              args.vectorizer_file)
else:
    # 否则，创建数据集和向量器
    dataset = SurnameDataset.load_dataset_and_make_vectorizer(args.surname_csv)
    # 保存向量器到文件
    dataset.save_vectorizer(args.vectorizer_file)
    
# 获取数据集的向量器
vectorizer = dataset.get_vectorizer()

# 创建姓氏分类器实例，根据向量器的词汇表大小确定初始通道数和类别数
classifier = SurnameClassifier(initial_num_channels=len(vectorizer.surname_vocab), 
                               num_classes=len(vectorizer.nationality_vocab),
                               num_channels=args.num_channels)

classifer = classifier.to(args.device)
dataset.class_weights = dataset.class_weights.to(args.device)

#交叉熵损失
loss_func = nn.CrossEntropyLoss(weight=dataset.class_weights)
#Adam优化算法
optimizer = optim.Adam(classifier.parameters(), lr=args.learning_rate)
#学习率调度器
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer=optimizer,
                                           mode='min', factor=0.5,
                                           patience=1)
#初始化训练状态
train_state = make_train_state(args)

#进度条设置，用于展示训练进度
epoch_bar = tqdm_notebook(desc='training routine', 
                          total=args.num_epochs,
                          position=0)

dataset.set_split('train')
train_bar = tqdm_notebook(desc='split=train',
                          total=dataset.get_num_batches(args.batch_size), 
                          position=1, 
                          leave=True)
dataset.set_split('val')
val_bar = tqdm_notebook(desc='split=val',
                        total=dataset.get_num_batches(args.batch_size), 
                        position=1, 
                        leave=True)

3.4.3 Training loop and evaluate

训练与验证过程中使用不同的key从batch_dict中获取数据。
训练流程遵循常见的前馈神经网络训练流程，迭代进行前向传播、计算损失、反向传播、最后根据梯度计算结果和选择的优化算法对模型参数进行优化，直至达到预设的最大训练轮次或达到早停条件。

try:
    #开始训练
    for epoch_index in range(args.num_epochs):
        train_state['epoch_index'] = epoch_index

        # Iterate over training dataset
        #训练阶段
        # setup: batch generator, set loss and acc to 0, set train mode on

        dataset.set_split('train')  #切换至训练数据集
        #生成批处理数据
        batch_generator = generate_batches(dataset, 
                                           batch_size=args.batch_size, 
                                           device=args.device)
        #累积损失与准确率初始化
        running_loss = 0.0
        running_acc = 0.0
        #切换至训练模式
        classifier.train()

        for batch_index, batch_dict in enumerate(batch_generator):
            # the training routine is these 5 steps:
            #梯度清零
            optimizer.zero_grad()
            #前向传播
            y_pred = classifier(batch_dict['x_surname'])
            #计算损失
            loss = loss_func(y_pred, batch_dict['y_nationality'])
            loss_t = loss.item()
            running_loss += (loss_t - running_loss) / (batch_index + 1)
            #反向传播、计算梯度
            loss.backward()
            #根据梯度更新参数
            optimizer.step()
            #计算准确率
            acc_t = compute_accuracy(y_pred, batch_dict['y_nationality'])
            running_acc += (acc_t - running_acc) / (batch_index + 1)
            #更新进度条
            train_bar.set_postfix(loss=running_loss, acc=running_acc, 
                            epoch=epoch_index)
            train_bar.update()
            
        #记录本轮训练的平均损失和准确率
        train_state['train_loss'].append(running_loss)
        train_state['train_acc'].append(running_acc)

        # Iterate over val dataset
        #验证阶段
        # setup: batch generator, set loss and acc to 0; set eval mode on
        dataset.set_split('val')   #切换至验证数据集
        #生成批次验证数据
        batch_generator = generate_batches(dataset, 
                                           batch_size=args.batch_size, 
                                           device=args.device)
        #重置累积损失与准确率
        running_loss = 0.
        running_acc = 0.
        #切换到验证模式
        classifier.eval()

        for batch_index, batch_dict in enumerate(batch_generator):
            #计算输出
            y_pred =  classifier(batch_dict['x_surname'])
            #计算损失
            loss = loss_func(y_pred, batch_dict['y_nationality'])
            loss_t = loss.item()
            running_loss += (loss_t - running_loss) / (batch_index + 1)
            #计算准确率
            acc_t = compute_accuracy(y_pred, batch_dict['y_nationality'])
            running_acc += (acc_t - running_acc) / (batch_index + 1)
            #更新进度条信息
            val_bar.set_postfix(loss=running_loss, acc=running_acc, 
                            epoch=epoch_index)
            val_bar.update()

        #记录本轮验证的平均损失和准确率
        train_state['val_loss'].append(running_loss)
        train_state['val_acc'].append(running_acc)

        #更新训练状态，并根据验证损失调整学习率
        train_state = update_train_state(args=args, model=classifier,
                                         train_state=train_state)

        scheduler.step(train_state['val_loss'][-1])

        #若满足提前停止条件，则跳出循环
        if train_state['stop_early']:
            break

        #重置进度条
        train_bar.n = 0
        val_bar.n = 0
        epoch_bar.update()
except KeyboardInterrupt:
    print("Exiting loop")

3.5 Prediction

3.5.1 Classifying a new surname

给定一个姓氏作为字符串，该函数将首先应用向量化过程，然后获得模型预测。注意，我们包含了apply_softmax标志，所以结果包含概率。模型预测，在多项式的情况下，是类概率的列表。我们使用PyTorch张量最大函数来得到由最高预测概率表示的最优类。

#根据姓氏预测国籍
def predict_nationality(surname, classifier, vectorizer):
    """Predict the nationality from a new surname
    
    Args:
        surname (str): the surname to classifier
        classifier (SurnameClassifer): an instance of the classifier
        vectorizer (SurnameVectorizer): the corresponding vectorizer
    Returns:
        a dictionary with the most likely nationality and its probability
    """
    #对待预测姓氏进行向量化并转换为torch张量，在添加一个维度以符合模型的输入要求
    vectorized_surname = vectorizer.vectorize(surname)
    vectorized_surname = torch.tensor(vectorized_surname).unsqueeze(0)
    #使用分类器对姓氏进行预测，应用softmax函数得到概率分布
    result = classifier(vectorized_surname, apply_softmax=True)
    
    #找到概率最大值及其索引
    probability_values, indices = result.max(dim=1)
    index = indices.item()

    #根据上述索引从国籍词汇表中找到预测的国籍
    predicted_nationality = vectorizer.nationality_vocab.lookup_index(index)
    probability_value = probability_values.item()

    #返回预测结果
    return {'nationality': predicted_nationality, 'probability': probability_value}

#进行预测
#输入待预测姓氏
new_surname = input("Enter a surname to classify: ")
#使用模型进行预测
classifier = classifier.cpu()
prediction = predict_nationality(new_surname, classifier, vectorizer)
#打印预测结果
print("{} -> {} (p={:0.2f})".format(new_surname,
                                    prediction['nationality'],
                                    prediction['probability']))

预测示例：
在这里插入图片描述

3.5.2 Retrieving the top-k predictions for a new surname

不仅要看最好的预测，还要看更多的预测。例如，NLP中的标准实践是采用k-best预测并使用另一个模型对它们重新排序。PyTorch提供了一个torch.topk函数，它提供了一种方便的方法来获得这些预测。

#根据姓氏预测前k个最又可能的国籍
def predict_topk_nationality(surname, classifier, vectorizer, k=5):
    """Predict the top K nationalities from a new surname
    
    Args:
        surname (str): the surname to classifier
        classifier (SurnameClassifer): an instance of the classifier
        vectorizer (SurnameVectorizer): the corresponding vectorizer
        k (int): the number of top nationalities to return
    Returns:
        list of dictionaries, each dictionary is a nationality and a probability
    """
    #将待预测国籍向量化并将其转换为torch张量，并添加一个维度以符合输入要求
    vectorized_surname = vectorizer.vectorize(surname)
    vectorized_surname = torch.tensor(vectorized_surname).unsqueeze(dim=0)
    #使用模型进行预测并使用softmax得到概率分布
    prediction_vector = classifier(vectorized_surname, apply_softmax=True)
    probability_values, indices = torch.topk(prediction_vector, k=k)
    
    # returned size is 1,k
    #查找概率最高的k个国籍及其索引
    probability_values = probability_values[0].detach().numpy()
    indices = indices[0].detach().numpy()
    
    results = []
    for kth_index in range(k):
        nationality = vectorizer.nationality_vocab.lookup_index(indices[kth_index])
        probability_value = probability_values[kth_index]
        results.append({'nationality': nationality, 
                        'probability': probability_value})
    return results

#进行预测
#输入待预测姓氏
new_surname = input("Enter a surname to classify: ")
#输入k值
k = int(input("How many of the top predictions to see? "))
if k > len(vectorizer.nationality_vocab):
    print("Sorry! That's more than the # of nationalities we have.. defaulting you to max size :)")
    k = len(vectorizer.nationality_vocab)
#进行预测
predictions = predict_topk_nationality(new_surname, classifier, vectorizer, k=k)
#打印预测结果
print("Top {} predictions:".format(k))
print("===================")
for prediction in predictions:
    print("{} -> {} (p={:0.2f})".format(new_surname,
                                        prediction['nationality'],
                                        prediction['probability']))

预测示例：
在这里插入图片描述
That’s all.

sparrow0053

关注

40
点赞
踩
52

收藏

觉得还不错? 一键收藏
0
评论
自然语言处理之使用卷积神经网络实现姓氏分类任务

使用CNN(卷积神经网络)模型实现根据姓氏预测国籍的分类任务。一个卷积神经网络通常包含输入层、卷积层、激活函数层、池化层、全连接层、批量归一化层、DropOut层、输出层。一个卷积神经网络通常包含多个卷积层、激活层与池化层。输入层：是网络的起点，负责接收原始数据，例如图像或文本。卷积层：是CNN的核心，用于提取输入数据的局部特征。通过卷积运算，可以捕捉图像中的边缘、纹理等信息。常用卷积层参数包括卷积核大小、步长、填充等。激活函数层：引入非线性，使网络能够学习和表达更复杂的函数关系。
复制链接

扫一扫