吾有三宝,一曰慈,二曰俭,三曰不敢为天下先。慈固能勇,俭故能广,不敢为天下先,故能成器长。
第4章围绕分类问题,介绍了两个简单模型,一是多层感知机,二是卷积神经网络。本文先介绍第一个模型,重点还在于代码剖析,补充书本遗漏的细节。
要解决的问题:给定人名,判别该人名属于哪个国家。
案例代码的整体结构和Example 3-1十分类似,在Classifier模块变化较大,引入了隐含层。
class SurnameClassifier(nn.Module):
""" A 2layer multilayer perceptron for classifying surnames """
def __init__(self,input_dim, hidden_dim, output_dim):
"""
Args:
input_dim (int): the size of the input vectors
hidden_dim (int): the output size of the first Linear layer
output_dim (int): the output size of the second Linear layer
"""
super(SurnameClassifier,self).__init__()
self.fc1=nn.Linear(input_dim,hidden_dim)
self.fc2=nn.Linear(hidden_dim,output_dim)
def forward(self,x_in,apply_softmax=False):
"""The forward pass of the classifier
Returns: the resulting tensor. tensor.shape should be (batch,)
"""
intermediate_vector=F.relu(self.fc1(x_in))
prediction_vector=self.fc2(intermediate_vector)
if apply_softmax:
prediction_vector=F.softmax(prediction_vector,dim=1)
return prediction_vector
由于语料的结构十分类似,所以数据载入模块也大体保持不变,唯一不同在于Ex. 3-1的token是以单词为单位,而本例是以单词中的字母为token,无需使用split进行切分,所以在将数据向量化时,需要作出调整,即调整类Vectorizer中的vectorize方法,最后生成的向量都是collapsed one hot vector,代码对比如下:
def vectorize(self,surname): def vectorize(self,surname):
""" """
Example 4-2 Example 3-1
""" """
vocab = self.surname_vocab vocab = self.review_vocab
one_hot=np.zeros(len(vocab),dtype=np.float32) one_hot=np.zeros(len(vocab),dtype=np.float32)
for token in surname: for token in review.split(" "):#区别在此
one_hot[vocab.lookup_token(token)]=1 one_hot[vocab.lookup_token(token)]=1
return one_hot return one_hot
本例所给的训练集中,涉及18个国家,即18个类别,但各类别对应的样本比重并不平均,为了解决这个不平衡分布,初始化类Dataset时设置了一个属性class_weights来计算各个类别的权重(样本数量的倒数)。本例中,权重不作为参数进行迭代更新。
# Class weights
class_counts = surname_df.nationality.value_counts().to_dict()
def sort_key(item):
return self._vectorizer.nationality_vocab.lookup_token(item[0])
# sort nationalities by initial letter order
# class_counts.items() returns a list of tuples
# key conveys elements (each tuple) in this list to the func sort_key
sorted_counts = sorted(class_counts.items(), key=sort_key)
frequencies = [count for _, count in sorted_counts]
self.class_weights = 1.0 / torch.tensor(frequencies, dtype=torch.float32)
在训练开始前,对各模块进行实例化,构建损失函数时,需先传入类别权重。
classifier=classifier.to(args.device)
dataset.class_weights = dataset.class_weights.to(args.device)
"""The weights assigned for each class is useful in an unbalanced training set.
Normally, we choose the reciprocal of sample number as the corresponding weight"""
loss_func=nn.CrossEntropyLoss(weight=dataset.class_weights)