Ch4. Example 4-2 Classify surnames with Multi Layer Perceptron

最新推荐文章于 2024-06-24 17:47:55 发布

剑齿薄荷

最新推荐文章于 2024-06-24 17:47:55 发布

阅读量116

点赞数 1

分类专栏： NLP with PyTorch 读书笔记

本文链接：https://blog.csdn.net/w295286543/article/details/100828565

版权

NLP with PyTorch 读书笔记专栏收录该内容

7 篇文章 0 订阅

订阅专栏

吾有三宝，一曰慈，二曰俭，三曰不敢为天下先。慈固能勇，俭故能广，不敢为天下先，故能成器长。

第4章围绕分类问题，介绍了两个简单模型，一是多层感知机，二是卷积神经网络。本文先介绍第一个模型，重点还在于代码剖析，补充书本遗漏的细节。

要解决的问题：给定人名，判别该人名属于哪个国家。

案例代码的整体结构和Example 3-1十分类似，在Classifier模块变化较大，引入了隐含层。

class SurnameClassifier(nn.Module):
    """ A 2layer multilayer perceptron for classifying surnames """
    def __init__(self,input_dim, hidden_dim, output_dim):
        """
        Args:
            input_dim (int): the size of the input vectors
            hidden_dim (int): the output size of the first Linear layer
            output_dim (int): the output size of the second Linear layer
        """
        super(SurnameClassifier,self).__init__()
        self.fc1=nn.Linear(input_dim,hidden_dim)
        self.fc2=nn.Linear(hidden_dim,output_dim)

    def forward(self,x_in,apply_softmax=False):
        """The forward pass of the classifier
            Returns: the resulting tensor. tensor.shape should be (batch,)
        """
        intermediate_vector=F.relu(self.fc1(x_in))
        prediction_vector=self.fc2(intermediate_vector)
    
        if apply_softmax:
            prediction_vector=F.softmax(prediction_vector,dim=1)
        
        return prediction_vector

由于语料的结构十分类似，所以数据载入模块也大体保持不变，唯一不同在于Ex. 3-1的token是以单词为单位，而本例是以单词中的字母为token，无需使用split进行切分，所以在将数据向量化时，需要作出调整，即调整类Vectorizer中的vectorize方法，最后生成的向量都是collapsed one hot vector，代码对比如下：

def vectorize(self,surname):                            def vectorize(self,surname):
    """                                                     """
    Example 4-2                                             Example 3-1
    """                                                     """
    vocab = self.surname_vocab                              vocab = self.review_vocab
    one_hot=np.zeros(len(vocab),dtype=np.float32)           one_hot=np.zeros(len(vocab),dtype=np.float32)
    for token in surname:                                   for token in review.split(" "):#区别在此
        one_hot[vocab.lookup_token(token)]=1            one_hot[vocab.lookup_token(token)]=1
    return one_hot                                          return one_hot

本例所给的训练集中，涉及18个国家，即18个类别，但各类别对应的样本比重并不平均，为了解决这个不平衡分布，初始化类Dataset时设置了一个属性class_weights来计算各个类别的权重（样本数量的倒数）。本例中，权重不作为参数进行迭代更新。

# Class weights
class_counts = surname_df.nationality.value_counts().to_dict()
def sort_key(item):
    return self._vectorizer.nationality_vocab.lookup_token(item[0])
# sort nationalities by initial letter order
# class_counts.items() returns a list of tuples
# key conveys elements (each tuple) in this list to the func sort_key
sorted_counts = sorted(class_counts.items(), key=sort_key) 
frequencies = [count for _, count in sorted_counts]
self.class_weights = 1.0 / torch.tensor(frequencies, dtype=torch.float32)

在训练开始前，对各模块进行实例化，构建损失函数时，需先传入类别权重。

classifier=classifier.to(args.device)
dataset.class_weights = dataset.class_weights.to(args.device)
"""The weights assigned for each class is useful in an unbalanced training set.
Normally, we choose the reciprocal of sample number as the corresponding weight"""
loss_func=nn.CrossEntropyLoss(weight=dataset.class_weights)