基于MLP和CNN的姓氏分类---NLP实践

最新推荐文章于 2024-06-29 10:48:59 发布

Firefly HH

最新推荐文章于 2024-06-29 10:48:59 发布

阅读量1k

点赞数 34

文章标签：自然语言处理 cnn 分类 pytorch 神经网络

本文链接：https://blog.csdn.net/2201_75550426/article/details/139981590

版权

一、多层感知机（MLP）

原理简介

多层感知器（MLP）被视为神经网络中最基本的构建模块之一。它是对感知器的扩展。感知器接受数据向量作为输入并计算单个输出值。而在MLP中，多个感知器被组合成层，使得每一层的输出不再是单个值，而是一个新的向量。在PyTorch中，如后面将要介绍的那样，这仅需要设置线性层中输出特征的数量即可实现。
MLP的另一个重要特点是它将多个层与层间的非线性结合在一起。

如下图，是一种具有两个线性层和三个表示阶段（输入向量、隐藏向量和输出向量)的MLP的可视化表示

优点：多层感知机可以通过隐藏神经元，捕捉到输入之间复杂的相互作用，这些神经元依赖于每个输入的值。我们可以很容易地设计隐藏节点来执行任意计算。例如，在一对输入上进行基本逻辑操作，多层感知机是通用近似器。即使是网络只有一个隐藏层，给定足够的神经元和正确的权重，我们可以对任意函数建模，尽管实际中学习该函数是很困难的。

MLP的Pytorch实现

        在介绍了MLP的核心思想后，我们将会在PyTorch中展示其实现。如前所述，MLP不仅仅是简单感知器的扩展，它包含额外的计算层。在下面的示例中，我们使用了PyTorch的两个线性模块来演示这个概念。这些线性模块被命名为fc1和fc2，符合通用约定，即将线性模块称为“全连接层”或简称为“fc层”。
        除了这两个线性层外，还有一个修正的线性单元（ReLU），它在第一个线性层的输出输入到第二个线性层之前进行非线性处理。由于层之间的顺序性，必须确保每一层的输出数量等于下一层的输入数量。使用两个线性层之间的非线性是必要的，因为如果没有这一步，两个线性层的组合在数学上等同于一个线性层，无法捕捉复杂的模式。
        MLP的实现仅涉及反向传播的前向传递。这是因为PyTorch根据模型定义和前向传播的实现，自动计算如何执行反向传播和更新梯度。

import torch.nn as nn
import torch.nn.functional as F #首先导入了PyTorch的神经网络模块和函数模块

class MultilayerPerceptron(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim): #三个参数：input_dim（输入向量的大小）、hidden_dim（第一个线性层的输出大小）和output_dim（第二个线性层的输出大小）。
        """
        Args:
            input_dim (int): the size of the input vectors
            hidden_dim (int): the output size of the first Linear layer
            output_dim (int): the output size of the second Linear layer
        """
        super(MultilayerPerceptron, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, output_dim) #定义了两个线性层（全连接层），分别命名为fc1和fc2。它们的输入输出维度分别是(input_dim, hidden_dim)和(hidden_dim, output_dim)。

    def forward(self, x_in, apply_softmax=False):
        """The forward pass of the MLP

        Args:
            x_in (torch.Tensor): an input data tensor.
                x_in.shape should be (batch, input_dim)
            apply_softmax (bool): a flag for the softmax activation
                should be false if used with the Cross Entropy losses
        Returns:
            the resulting tensor. tensor.shape should be (batch, output_dim)
        """
        intermediate = F.relu(self.fc1(x_in))
        output = self.fc2(intermediate) #通过第一个线性层（fc1）传播输入数据，并应用ReLU激活函数。然后将激活后的结果传递给第二个线性层（fc2）

        if apply_softmax:
            output = F.softmax(output, dim=1) #如果apply_softmax为True，则对输出进行softmax激活。通常在使用交叉熵损失函数时，不需要应用softmax激活
        return output #返回模型的输出张量

为了演示，我们使用大小为3的输入维度、大小为4的输出维度和大小为100的隐藏维度。请注意，在print语句的输出中，每个层中的单元数很好地排列在一起，以便为维度3的输入生成维度4的输出。

import torch.nn as nn
import torch.nn.functional as F #首先导入了PyTorch的神经网络模块和函数模块

class MultilayerPerceptron(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim): #三个参数：input_dim（输入向量的大小）、hidden_dim（第一个线性层的输出大小）和output_dim（第二个线性层的输出大小）。
        """
        Args:
            input_dim (int): the size of the input vectors
            hidden_dim (int): the output size of the first Linear layer
            output_dim (int): the output size of the second Linear layer
        """
        super(MultilayerPerceptron, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, output_dim) #定义了两个线性层（全连接层），分别命名为fc1和fc2。它们的输入输出维度分别是(input_dim, hidden_dim)和(hidden_dim, output_dim)。

    def forward(self, x_in, apply_softmax=False):
        """The forward pass of the MLP

        Args:
            x_in (torch.Tensor): an input data tensor.
                x_in.shape should be (batch, input_dim)
            apply_softmax (bool): a flag for the softmax activation
                should be false if used with the Cross Entropy losses
        Returns:
            the resulting tensor. tensor.shape should be (batch, output_dim)
        """
        intermediate = F.relu(self.fc1(x_in))
        output = self.fc2(intermediate) #通过第一个线性层（fc1）传播输入数据，并应用ReLU激活函数。然后将激活后的结果传递给第二个线性层（fc2）

        if apply_softmax:
            output = F.softmax(output, dim=1) #如果apply_softmax为True，则对输出进行softmax激活。通常在使用交叉熵损失函数时，不需要应用softmax激活
        return output #返回模型的输出张量
batch_size = 2 # number of samples input at once
input_dim = 3
hidden_dim = 100
output_dim = 4

# Initialize model
mlp = MultilayerPerceptron(input_dim, hidden_dim, output_dim)
print(mlp)

代码运行结果：

我们可以通过传递一些随机输入来快速测试模型的“连通性”，就像下面的例子所示。由于模型尚未经过训练，因此输出是随机的。在进行模型训练之前，这样做是一个有用的完整性检查。

import torch
def describe(x):
    print("Type: {}".format(x.type()))
    print("Shape/size: {}".format(x.shape))
    print("Values: \n{}".format(x))

x_input = torch.rand(batch_size, input_dim)
describe(x_input) #调用describe函数，传入刚刚创建的随机张量x_input，以打印该张量的类型、形状/大小以及值。

代码运行结果：

import torch
def describe(x):
    print("Type: {}".format(x.type()))
    print("Shape/size: {}".format(x.shape))
    print("Values: \n{}".format(x))

y_output = mlp(x_input, apply_softmax=False)
describe(y_output)

代码运行结果：

综上所述，MLP是一种通过线性层将张量映射到另一个张量的方法。它在每对线性层之间引入非线性，以打破简单的线性关系，并允许模型扭曲向量空间。在分类设置中，这种扭曲应当有助于使不同类别在向量空间中呈现出线性可分性。

二、基于MLP的姓氏分类任务

1.姓氏数据集

本文的姓氏数据集收集了来自18个不同国家的10,000个姓氏，这些姓氏来源于互联网上的多个姓名资源。该数据集具有以下特点：首先，数据集相当不平衡，原始数据中70%以上的姓氏属于俄文，这可能是由于抽样偏差或俄文姓氏的普遍性。为了解决这一问题，我们从标记为俄语的姓氏中随机抽取了子集，以减少这一类别的过度代表性。其次，国籍与姓氏正字法（拼写）之间存在有效且直观的关系。某些拼写变体与其原籍国家密切相关。
为了处理数据集，我们按国籍对数据进行了分组，并将数据分为三个部分：70%用作训练数据集，15%作为验证数据集，最后15%作为测试数据集。这样做的目的是确保在这三个部分中，类别标签的分布是可比的。

示例：一个向量化的姓氏和与其国籍相对应的索引

import collections
import numpy as np
import pandas as pd
import re

from argparse import Namespace

args = Namespace(
    raw_dataset_csv="data/surnames/surnames.csv",
    train_proportion=0.7,
    val_proportion=0.15,
    test_proportion=0.15,
    output_munged_csv="data/surnames/surnames_with_splits.csv",
    seed=1337
)

# Read raw data
surnames = pd.read_csv(args.raw_dataset_csv, header=0)

surnames.head()

代码运行结果：

2.词汇表、矢量器和数据加载器

为了使用字符对姓氏进行分类，我们使用词汇表、向量化器和DataLoader将姓氏字符串转换为向量化的minibatches。

from argparse import Namespace
from collections import Counter
import json
import os
import string

import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm_notebook

class Vocabulary(object):
    """Class to process text and extract vocabulary for mapping"""

    def __init__(self, token_to_idx=None, add_unk=True, unk_token="<UNK>"):
        """
        Args:
            token_to_idx (dict): a pre-existing map of tokens to indices
            add_unk (bool): a flag that indicates whether to add the UNK token
            unk_token (str): the UNK token to add into the Vocabulary
        """

        if token_to_idx is None:
            token_to_idx = {}
        self._token_to_idx = token_to_idx

        self._idx_to_token = {idx: token 
                              for token, idx in self._token_to_idx.items()}
        
        self._add_unk = add_unk
        self._unk_token = unk_token
        
        self.unk_index = -1
        if add_unk:
            self.unk_index = self.add_token(unk_token) 
        
        
    def to_serializable(self):
        """ returns a dictionary that can be serialized """
        return {'token_to_idx': self._token_to_idx, 
                'add_unk': self._add_unk, 
                'unk_token': self._unk_token}

    @classmethod
    def from_serializable(cls, contents):
        """ instantiates the Vocabulary from a serialized dictionary """
        return cls(**contents)

    def add_token(self, token):
        """Update mapping dicts based on the token.

        Args:
            token (str): the item to add into the Vocabulary
        Returns:
            index (int): the integer corresponding to the token
        """
        try:
            index = self._token_to_idx[token]
        except KeyError:
            index = len(self._token_to_idx)
            self._token_to_idx[token] = index
            self._idx_to_token[index] = token
        return index
    
    def add_many(self, tokens):
        """Add a list of tokens into the Vocabulary
        
        Args:
            tokens (list): a list of string tokens
        Returns:
            indices (list): a list of indices corresponding to the tokens
        """
        return [self.add_token(token) for token in tokens]

    def lookup_token(self, token):
        """Retrieve the index associated with the token 
          or the UNK index if token isn't present.
        
        Args:
            token (str): the token to look up 
        Returns:
            index (int): the index corresponding to the token
        Notes:
            `unk_index` needs to be >=0 (having been added into the Vocabulary) 
              for the UNK functionality 
        """
        if self.unk_index >= 0:
            return self._token_to_idx.get(token, self.unk_index)
        else:
            return self._token_to_idx[token]

    def lookup_index(self, index):
        """Return the token associated with the index
        
        Args: 
            index (int): the index to look up
        Returns:
            token (str): the token corresponding to the index
        Raises:
            KeyError: if the index is not in the Vocabulary
        """
        if index not in self._idx_to_token:
            raise KeyError("the index (%d) is not in the Vocabulary" % index)
        return self._idx_to_token[index]

    def __str__(self):
        return "<Vocabulary(size=%d)>" % len(self)

    def __len__(self):
        return len(self._token_to_idx)

class SurnameVectorizer(object):
    """ The Vectorizer which coordinates the Vocabularies and puts them to use"""
    def __init__(self, surname_vocab, nationality_vocab, max_surname_length):
        """
        Args:
            surname_vocab (Vocabulary): maps characters to integers
            nationality_vocab (Vocabulary): maps nationalities to integers
            max_surname_length (int): the length of the longest surname
        """
        self.surname_vocab = surname_vocab
        self.nationality_vocab = nationality_vocab
        self._max_surname_length = max_surname_length

    def vectorize(self, surname):
        """
        Args:
            surname (str): the surname
        Returns:
            one_hot_matrix (np.ndarray): a matrix of one-hot vectors
        """

        one_hot_matrix_size = (len(self.surname_vocab), self._max_surname_length)
        one_hot_matrix = np.zeros(one_hot_matrix_size, dtype=np.float32)
                               
        for position_index, character in enumerate(surname):
            character_index = self.surname_vocab.lookup_token(character)
            one_hot_matrix[character_index][position_index] = 1
        
        return one_hot_matrix

    @classmethod
    def from_dataframe(cls, surname_df):
        """Instantiate the vectorizer from the dataset dataframe
        
        Args:
            surname_df (pandas.DataFrame): the surnames dataset
        Returns:
            an instance of the SurnameVectorizer
        """
        surname_vocab = Vocabulary(unk_token="@")
        nationality_vocab = Vocabulary(add_unk=False)
        max_surname_length = 0

        for index, row in surname_df.iterrows():
            max_surname_length = max(max_surname_length, len(row.surname))
            for letter in row.surname:
                surname_vocab.add_token(letter)
            nationality_vocab.add_token(row.nationality)

        return cls(surname_vocab, nationality_vocab, max_surname_length)

    @classmethod
    def from_serializable(cls, contents):
        surname_vocab = Vocabulary.from_serializable(contents['surname_vocab'])
        nationality_vocab =  Vocabulary.from_serializable(contents['nationality_vocab'])
        return cls(surname_vocab=surname_vocab, nationality_vocab=nationality_vocab, 
                   max_surname_length=contents['max_surname_length'])

    def to_serializable(self):
        return {'surname_vocab': self.surname_vocab.to_serializable(),
                'nationality_vocab': self.nationality_vocab.to_serializable(), 
                'max_surname_length': self._max_surname_length}

3.姓氏分类器模型

SurnameClassifier是本实验前面介绍的MLP的实现。第一个线性层将输入向量映射到中间向量，并对该向量应用非线性。第二线性层将中间向量映射到预测向量。

import torch.nn as nn
import torch.nn.functional as F

class SurnameClassifier(nn.Module):
    """ A 2-layer Multilayer Perceptron for classifying surnames """
    def __init__(self, input_dim, hidden_dim, output_dim):
        """
        Args:
            input_dim (int): the size of the input vectors
            hidden_dim (int): the output size of the first Linear layer
            output_dim (int): the output size of the second Linear layer
        """
        super(SurnameClassifier, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, x_in, apply_softmax=False):
        """The forward pass of the classifier

        Args:
            x_in (torch.Tensor): an input data tensor.
                x_in.shape should be (batch, input_dim)
            apply_softmax (bool): a flag for the softmax activation
                should be false if used with the Cross Entropy losses
        Returns:
            the resulting tensor. tensor.shape should be (batch, output_dim)
        """
        intermediate_vector = F.relu(self.fc1(x_in))
        prediction_vector = self.fc2(intermediate_vector)

        if apply_softmax:
            prediction_vector = F.softmax(prediction_vector, dim=1)

        return prediction_vector

4.训练过程

训练中最显著的差异与模型中输出的种类和使用的损失函数有关。在下面例子中，输出是一个多类预测向量，可以转换为概率。

args = Namespace(
    # Data and Path information
    surname_csv="data/surnames/surnames_with_splits.csv",
    vectorizer_file="vectorizer.json",
    model_state_file="model.pth",
    save_dir="model_storage/ch4/cnn",
    # Model hyper parameters
    hidden_dim=100,
    num_channels=256,
    # Training hyper parameters
    seed=1337,
    learning_rate=0.001,
    batch_size=128,
    num_epochs=100,
    early_stopping_criteria=5,
    dropout_p=0.1,
    # Runtime options
    cuda=False,
    reload_from_files=False,
    expand_filepaths_to_save_dir=True,
    catch_keyboard_interrupt=True
)
# Check CUDA
if not torch.cuda.is_available():
    args.cuda = False

args.device = torch.device("cuda" if args.cuda else "cpu")

class SurnameDataset(Dataset):
    def __init__(self, surname_df, vectorizer):
        """
        Args:
            name_df (pandas.DataFrame): the dataset
            vectorizer (SurnameVectorizer): vectorizer instatiated from dataset
        """
        self.surname_df = surname_df
        self._vectorizer = vectorizer
        self.train_df = self.surname_df[self.surname_df.split=='train']
        self.train_size = len(self.train_df)

        self.val_df = self.surname_df[self.surname_df.split=='val']
        self.validation_size = len(self.val_df)

        self.test_df = self.surname_df[self.surname_df.split=='test']
        self.test_size = len(self.test_df)

        self._lookup_dict = {'train': (self.train_df, self.train_size),
                             'val': (self.val_df, self.validation_size),
                             'test': (self.test_df, self.test_size)}

        self.set_split('train')
        
        # Class weights
        class_counts = surname_df.nationality.value_counts().to_dict()
        def sort_key(item):
            return self._vectorizer.nationality_vocab.lookup_token(item[0])
        sorted_counts = sorted(class_counts.items(), key=sort_key)
        frequencies = [count for _, count in sorted_counts]
        self.class_weights = 1.0 / torch.tensor(frequencies, dtype=torch.float32)


    @classmethod
    def load_dataset_and_make_vectorizer(cls, surname_csv):
        """Load dataset and make a new vectorizer from scratch
        
        Args:
            surname_csv (str): location of the dataset
        Returns:
            an instance of SurnameDataset
        """
        surname_df = pd.read_csv(surname_csv)
        train_surname_df = surname_df[surname_df.split=='train']
        return cls(surname_df, SurnameVectorizer.from_dataframe(train_surname_df))

    @classmethod
    def load_dataset_and_load_vectorizer(cls, surname_csv, vectorizer_filepath):
        """Load dataset and the corresponding vectorizer. 
        Used in the case in the vectorizer has been cached for re-use
        
        Args:
            surname_csv (str): location of the dataset
            vectorizer_filepath (str): location of the saved vectorizer
        Returns:
            an instance of SurnameDataset
        """
        surname_df = pd.read_csv(surname_csv)
        vectorizer = cls.load_vectorizer_only(vectorizer_filepath)
        return cls(surname_df, vectorizer)

    @staticmethod
    def load_vectorizer_only(vectorizer_filepath):
        """a static method for loading the vectorizer from file
        
        Args:
            vectorizer_filepath (str): the location of the serialized vectorizer
        Returns:
            an instance of SurnameDataset
        """
        with open(vectorizer_filepath) as fp:
            return SurnameVectorizer.from_serializable(json.load(fp))

    def save_vectorizer(self, vectorizer_filepath):
        """saves the vectorizer to disk using json
        
        Args:
            vectorizer_filepath (str): the location to save the vectorizer
        """
        with open(vectorizer_filepath, "w") as fp:
            json.dump(self._vectorizer.to_serializable(), fp)

    def get_vectorizer(self):
        """ returns the vectorizer """
        return self._vectorizer

    def set_split(self, split="train"):
        """ selects the splits in the dataset using a column in the dataframe """
        self._target_split = split
        self._target_df, self._target_size = self._lookup_dict[split]

    def __len__(self):
        return self._target_size

    def __getitem__(self, index):
        """the primary entry point method for PyTorch datasets
        
        Args:
            index (int): the index to the data point 
        Returns:
            a dictionary holding the data point's features (x_data) and label (y_target)
        """
        row = self._target_df.iloc[index]

        surname_matrix = \
            self._vectorizer.vectorize(row.surname)

        nationality_index = \
            self._vectorizer.nationality_vocab.lookup_token(row.nationality)

        return {'x_surname': surname_matrix,
                'y_nationality': nationality_index}

    def get_num_batches(self, batch_size):
        """Given a batch size, return the number of batches in the dataset
        
        Args:
            batch_size (int)
        Returns:
            number of batches in the dataset
        """
        return len(self) // batch_size

    
def generate_batches(dataset, batch_size, shuffle=True,
                     drop_last=True, device="cpu"):
    """
    A generator function which wraps the PyTorch DataLoader. It will 
      ensure each tensor is on the write device location.
    """
    dataloader = DataLoader(dataset=dataset, batch_size=batch_size,
                            shuffle=shuffle, drop_last=drop_last)

    for data_dict in dataloader:
        out_data_dict = {}
        for name, tensor in data_dict.items():
            out_data_dict[name] = data_dict[name].to(device)
        yield out_data_dict

dataset = SurnameDataset.load_dataset_and_make_vectorizer(args.surname_csv) # 加载数据集并创建向量化器

vectorizer = dataset.get_vectorizer() # 获取数据集的向量化器

classifier = SurnameClassifier(input_dim=len(vectorizer.surname_vocab),
                               hidden_dim=args.hidden_dim,
                               output_dim=len(vectorizer.nationality_vocab))

classifier = classifier.to(args.device)  # 将分类器移动到指定设备（如 CPU）   

loss_func = nn.CrossEntropyLoss(dataset.class_weights) # 定义损失函数（交叉熵损失），并考虑样本权重
optimizer = optim.Adam(classifier.parameters(), lr=args.learning_rate) # 定义优化器（Adam 优化器），用于更新分类器的参数

代码输出结果：

三、卷积神经网络CNN

原理简介

卷积神经网络（CNN）是一种非常适合检测空间子结构并因此创建有意义的空间子结构的神经网络。CNN通过使用少量的权重来扫描输入数据张量来实现这一目标。如下图是一个简单的CNN模型：

红框框起来的部分便可以理解为一个滤波器，即带着一组固定权重的神经元。多个滤波器叠加便成了卷积层。

1.输入层
输入层接收原始图像数据。图像通常由三个颜色通道（红、绿、蓝）组成，形成一个二维矩阵，表示像素的强度值。

2.卷积和激活
卷积层通过与卷积核进行卷积操作来提取特征，并通过应用激活函数（如ReLU）引入非线性。这一步使网络能够学习复杂的特征。

3.池化层
池化层通过选择池化窗口内的最大值或平均值来减小特征图的大小，从而减少计算复杂性。这有助于提取最重要的特征。

4.多层堆叠
CNN通常由多个卷积层和池化层堆叠而成，逐渐提取更高级别的特征。深层次的特征可以表示更复杂的模式。

5.全连接和输出
最后，全连接层将提取的特征映射转化为网络的最终输出。这可以是一个分类标签、回归值或其他任务的结果。

CNN的Pytorch实现

这次我们转向使用CNN而不是MLP。在这种情况下，我们仍然需要最后一个线性层，它将学习从一系列卷积层创建的特征向量中生成预测向量。因此，关键在于配置卷积层，以获得所需的特征向量。所有CNN应用都遵循相似的流程：首先通过一系列卷积层提取特征图，然后将其传递给上游处理。在分类任务中，通常会使用线性（或全连接）层来处理这些特征。

构造特征向量的第一步是将PyTorch的Conv1d类的一个实例应用到三维数据张量。通过检查输出的大小，你可以知道张量减少了多少。

import torch.nn as nn
import torch.nn.functional as F
import torch

def describe(x):
    print("Type: {}".format(x.type()))
    print("Shape/size: {}".format(x.shape))
    print("Values: \n{}".format(x))
    
# Inputs
x_input = torch.rand(batch_size, input_dim)
describe(x_input)

class MultilayerPerceptron(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        """
        Args:
            input_dim (int): the size of the input vectors
            hidden_dim (int): the output size of the first Linear layer
            output_dim (int): the output size of the second Linear layer
        """
        super(MultilayerPerceptron, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, x_in, apply_softmax=False):
        """The forward pass of the MLP
        
        Args:
            x_in (torch.Tensor): an input data tensor. 
                x_in.shape should be (batch, input_dim)
            apply_softmax (bool): a flag for the softmax activation
                should be false if used with the Cross Entropy losses
        Returns:
            the resulting tensor. tensor.shape should be (batch, output_dim)
        """
        intermediate = F.relu(self.fc1(x_in))
        output = self.fc2(F.dropout(intermediate, p=0.5))
        
        if apply_softmax:
            output = F.softmax(output, dim=1)
        return output

batch_size = 2 # number of samples input at once
input_dim = 3
hidden_dim = 100
output_dim = 4

# Initialize model
mlp = MultilayerPerceptron(input_dim, hidden_dim, output_dim)


y_output = mlp(x_input, apply_softmax=False)
describe(y_output)

batch_size = 2
one_hot_size = 10
sequence_width = 7
data = torch.randn(batch_size, one_hot_size, sequence_width)
conv1 = nn.Conv1d(in_channels=one_hot_size, out_channels=16, kernel_size=3)
intermediate1 = conv1(data)
conv2 = nn.Conv1d(in_channels=16, out_channels=32, kernel_size=3)
conv3 = nn.Conv1d(in_channels=32, out_channels=64, kernel_size=3)

intermediate2 = conv2(intermediate1)
intermediate3 = conv3(intermediate2)

print(intermediate2.size())
print(intermediate3.size())

代码运行结果：

四、基于CNN的姓氏分类任务

1.姓氏数据库

虽然之前在“带有多层感知器的姓氏分类”中已描述了姓氏数据集，但在CNN实现中存在一个关键差异：数据集由一个one-hot向量矩阵组成，而不是单个压缩的one-hot向量。为了实现这一点，我们开发了一个数据集类，它跟踪最长的姓氏，并将其作为矩阵中的行数提供给矢量化器。矩阵的列数等于one-hot向量的大小（即词汇表的大小）。
我们选择使用数据集中最长的姓氏来决定one-hot矩阵的大小有两个原因。首先，将每个姓氏矩阵批量组合成一个三维张量要求它们具有相同的尺寸。其次，通过使用最长的姓氏，我们可以以统一的方式处理每个批次的数据。

class SurnameDataset(Dataset):
    # ... existing implementation from Section 4.2

    def __getitem__(self, index):
        row = self._target_df.iloc[index]

        surname_matrix = \
            self._vectorizer.vectorize(row.surname, self._max_seq_length)

        nationality_index = \
             self._vectorizer.nationality_vocab.lookup_token(row.nationality)

        return {'x_surname': surname_matrix,
                'y_nationality': nationality_index}

2.词汇表、矢量器和数据加载器*

这次Vectorizer的vectorize()方法已经根据CNN模型的需求进行了调整。具体来说，就像我们在下面一段代码中展示的那样，该方法将字符串中的每个字符映射到一个整数，并使用这些整数构建一个由one-hot向量组成的矩阵。重要的是，矩阵中的每一列都是不同的one-hot向量。这样做的主要原因是我们将使用的Conv1d层要求数据张量在第0维上具有批处理大小，在第1维上具有通道数，在第2维上具有特征数。

class SurnameVectorizer(object):
    """ The Vectorizer which coordinates the Vocabularies and puts them to use"""
    def vectorize(self, surname):
        """
        Args:
            surname (str): the surname
        Returns:
            one_hot_matrix (np.ndarray): a matrix of one-hot vectors
        """

        one_hot_matrix_size = (len(self.character_vocab), self.max_surname_length) # 初始化独热向量矩阵的大小
        one_hot_matrix = np.zeros(one_hot_matrix_size, dtype=np.float32) # 创建一个全零矩阵作为初始独热向量矩阵

        for position_index, character in enumerate(surname): # 遍历姓氏中的每个字符
            character_index = self.character_vocab.lookup_token(character)
            one_hot_matrix[character_index][position_index] = 1

        return one_hot_matrix

    @classmethod
    def from_dataframe(cls, surname_df):
        """Instantiate the vectorizer from the dataset dataframe

        Args:
            surname_df (pandas.DataFrame): the surnames dataset
        Returns:
            an instance of the SurnameVectorizer
        """
        character_vocab = Vocabulary(unk_token="@")
        nationality_vocab = Vocabulary(add_unk=False)
        max_surname_length = 0 # 初始化最大姓氏长度
        # 遍历数据集中的每一行
        for index, row in surname_df.iterrows():
            max_surname_length = max(max_surname_length, len(row.surname))
            for letter in row.surname:
                character_vocab.add_token(letter)
            nationality_vocab.add_token(row.nationality)
        # 返回 SurnameVectorizer 的一个实例，包括字符词汇表、国籍词汇表和最大姓氏长度
        return cls(character_vocab, nationality_vocab, max_surname_length)

3.用卷积网络重新实现姓氏分类器

import torch.nn as nn
import torch.nn.functional as F

class SurnameClassifier(nn.Module):
    def __init__(self, initial_num_channels, num_classes, num_channels):
        """
        Args:
            initial_num_channels (int): size of the incoming feature vector
            num_classes (int): size of the output prediction vector
            num_channels (int): constant channel size to use throughout network
        """
        super(SurnameClassifier, self).__init__()

        self.convnet = nn.Sequential(
            nn.Conv1d(in_channels=initial_num_channels,
                      out_channels=num_channels, kernel_size=3),
            nn.ELU(),
            nn.Conv1d(in_channels=num_channels, out_channels=num_channels,
                      kernel_size=3, stride=2),
            nn.ELU(),
            nn.Conv1d(in_channels=num_channels, out_channels=num_channels,
                      kernel_size=3, stride=2),
            nn.ELU(),
            nn.Conv1d(in_channels=num_channels, out_channels=num_channels,
                      kernel_size=3),
            nn.ELU()
        )
        self.fc = nn.Linear(num_channels, num_classes)

    def forward(self, x_surname, apply_softmax=False):
        """The forward pass of the classifier

        Args:
            x_surname (torch.Tensor): an input data tensor.
                x_surname.shape should be (batch, initial_num_channels,
                                           max_surname_length)
            apply_softmax (bool): a flag for the softmax activation
                should be false if used with the Cross Entropy losses
        Returns:
            the resulting tensor. tensor.shape should be (batch, num_classes)
        """
        features = self.convnet(x_surname).squeeze(dim=2)
        prediction_vector = self.fc(features)

        if apply_softmax:
            prediction_vector = F.softmax(prediction_vector, dim=1)

        return prediction_vector

4.网络训练

训练程序包括以下操作:实例化数据集,实例化模型,实例化损失函数,实例化优化器,遍历数据集的训练分区和更新模型参数,遍历数据集的验证分区和测量性能,然后重复数据集迭代一定次数。

args = Namespace(
    # Data and Path information
    surname_csv="data/surnames/surnames_with_splits.csv",
    vectorizer_file="vectorizer.json",
    model_state_file="model.pth",
    save_dir="model_storage/ch4/cnn",
    # Model hyper parameters
    hidden_dim=100,
    num_channels=256,
    # Training hyper parameters
    seed=1337,
    learning_rate=0.001,
    batch_size=128,
    num_epochs=100,
    early_stopping_criteria=5,
    dropout_p=0.1,
    # Runtime omitted for space ...
)

5.模型评价与预测

要理解模型的性能，需要对性能进行定量和定性的度量。

def predict_nationality(surname, classifier, vectorizer):
    """Predict the nationality from a new surname

    Args:
        surname (str): the surname to classifier
        classifier (SurnameClassifer): an instance of the classifier
        vectorizer (SurnameVectorizer): the corresponding vectorizer
    Returns:
        a dictionary with the most likely nationality and its probability
    """
    vectorized_surname = vectorizer.vectorize(surname) # 使用提供的向量化器将输入的姓氏转换为向量格式
    vectorized_surname = torch.tensor(vectorized_surname).unsqueeze(0) # 将向量化后的姓氏转换为 PyTorch 张量并添加一个批次维度
    result = classifier(vectorized_surname, apply_softmax=True) # 通过分类器传递向量化的姓氏并应用 softmax 函数以获取概率

    probability_values, indices = result.max(dim=1) # 找到具有最高概率的索引（即最可能的国籍）
    index = indices.item()

    predicted_nationality = vectorizer.nationality_vocab.lookup_index(index)
    probability_value = probability_values.item()

    return {'nationality': predicted_nationality, 'probability': probability_value} # 返回一个字典，其中包含预测的国籍及其概率

代码输出结果：