CS231n：作业1——softmax

本文链接：https://blog.csdn.net/Wangpeiyi9979/article/details/98652124

前言
详细代码见github

问答总结

在创建softmax标签是，我门需要构造one-hot向量，如何使用切片索引快速构造?

文章目录

一、实验目标
二、数据集
三、实验方法
三、代码：
四、实验
五、参考资料

一、实验目标

使用cifar-10数据集实现softmax损失分类器，推导梯度更新公式，使用随机梯度下算法更新梯度。

二、数据集

数据集依然使用cifar-10, 加载方法见此

三、实验方法

1、损失函数

softmax损失函数定义如下:
$-\mathbf{y}^Tlog(\frac{exp(W\mathbf{x})}{\mathbf{1}^Texp(W\mathbf{x})})$
其中 $\mathbf{y} \in R^C$ , 且只有样本 $\mathbf{x}$ 对应类别处值为1，其余值为0. $\in R^{C \times K}$ , $\mathbf{x} \in R^K$ 表示样本.

因此可以进一步化简得到:
$-\mathbf{y^T}W\mathbf{x}+log(\mathbf{1}^Texp(Wx))$

2、梯度更新

我们的目标是求：
$\frac{\partial{l}}{\partial{W}}$
这是标量关于矩阵的求导. 使用矩阵求导术:

$\begin{aligned} dl &= -\mathbf{y}^TdW\mathbf{x}+\frac{\mathbf{1}^T(exp(W\mathbf{x}) \odot (dW\mathbf{x}))}{\mathbf{1}^Texp(W\mathbf{x})} \\ &= -\mathbf{y}^TdW\mathbf{x}+\frac{exp^T(W\mathbf{x}) dW\mathbf{x}}{\mathbf{1}^Texp(W\mathbf{x})} \\ &=(-\mathbf{y}^T+softmax^T(W\mathbf{x}))dW\mathbf{x} \\ &=tr((-\mathbf{y}^T+softmax^T(W\mathbf{x}))dW\mathbf{x}) \\ &= tr(\mathbf{x}(-\mathbf{y}^T+softmax^T(W\mathbf{x}))dW) \end{aligned}$

故
$\frac{\partial{l}}{\partial{W}} =(-\mathbf{y}+softmax(W\mathbf{x})) \mathbf{x}^T$

3、加入正则项

加入正则项后，以二范数举例，一个batch的损失计算如下:

$\frac{1}{N}\sum_{i=1}^Nl_i + \frac{1}{2}\lambda||W||^2$

很容易求得, 加入正则项后，梯度会增加 $\lambda W$ :

三、代码：

(1) 纯循环代码:
根据推导公式，我们很容易使用循环计算梯度和损失。

def cal_dw_with_loop(self, X, Y, reg):
     """
     功能： 计算损失和梯度
     输入:
         X(Tensor):(K:3*32*32+1, N)
         Y(Tensor):(C, N)
         reg(float):                    # 正则化系数
     输出:
         L(int): 1                      # 损失               
         dW(Tensor):(C,K)             # 参数梯度       
     """
     L = 0.0
     N = X.size(1)
     K, C = self.W.size()
     dW = torch.zeros(K, C)
     
     # (1) 求解损失
     for i in range(N):
         x = X[:,i].unsqueeze(1)            # (K,1)
         y = Y[:,i].unsqueeze(1)            # (C,1)
         L += -y.t().matmul(self.W).matmul(x).item() + torch.log(torch.sum(torch.exp(self.W.matmul(x)))).item()
         dW = dW + (-y + torch.softmax(self.W.matmul(x), 0)) * x.t()
     
     # (2) 正则化
     L = L / N +  0.5*reg*torch.sum(torch.pow(self.W, 2)).item()
     dW = dW / N +  reg*self.W
     
     return L, dW

(2) 向量化代码
对于每批 $N$ 个数据，也可以很容易得到向量化的梯度更新公式:
$\frac{\partial L}{\partial W} = \frac{-YX^T+softmax(WX)X^T}{N} + \lambda W$
其中 $\in R^{C \times N}, X \in R^{K \times N}, W\in R^{C \times K}$

def cal_dw_with_vec(self, X, Y, reg):
     """
     功能： 计算损失和梯度
     输入:
         X(Tensor):(K:3*32*32+1, N)
         Y(Tensor):(C, N)
         reg(float):                    # 正则化系数
     输出:
         L(int): 1                      # 损失               
         dW(Tensor):(K,C)             # 参数梯度      
     """
     
     N = X.size(1)
     K, C = self.W.size()
     
     L1 = -Y.t().matmul(self.W).matmul(X)  # (N, N) 
     L2 = torch.sum(torch.exp(self.W.matmul(X)), 0)     # (C, N)
     L = torch.sum(L1[range(N), range(N)]).item() + torch.sum(torch.log(L2)).item()
     dW = -Y.matmul(X.t()) + torch.softmax(self.W.matmul(X), 0).matmul(X.t())
     
     L = L / N +  0.5*reg*torch.sum(torch.pow(self.W, 2)).item()
     dW = dW / N + reg*self.W
     return L, dW

四、实验

由于pytorch直接加载的数据集不便划分为训练集、验证集、测试集，我们仅仅使用训练集和测试集。在训练集上训练，在测试集上测试。过程如下：

lrs = [1e-2, 1e-3, 1e-4, 1e-5]
reg_strs = [0, 1, 10, 100, 1000]

result = {}

best_lr = None
best_reg = None
best_model = None
best_acc = -1

for lr in lrs:
    for reg in reg_strs:
        model = train(lr, reg, 100)
        acc = evaluate(model)
        print("lr:{}; reg:{}; acc:{}".format(lr, reg, acc))
        if acc > best_acc:
            best_lr = lr
            best_reg = reg
            best_model = model
        result[(lr, reg)] = acc
print("the best: lr:{}; reg:{}; acc:{}".format(best_lr, best_reg, best_acc))