softmax函数简介与符号说明
softmax函数适用于处理多分类问题,应用广泛的逻辑函数就是softmax函数在二分类情形下的特例。softmax函数将一个n维的输入向量映射为n维的向量,使得输出向量的各元素取值在0到1之间,且所有元素之和为1,即所得到的向量可以作为事件发生的概率。记函数的输入向量为:$Z = (z_1,z_2,\cdots,z_n)^\top$,则函数值为:
$$softmax(X) =(\frac{e^{x_1}}{\sum_{i=1}^{n}e^{x_i}},\frac{e^{x_2}}{\sum_{i=1}^{n}e^{x_i}},\cdots,\frac{e^{x_n}}{\sum_{i=1}^{n}e^{x_i}})^\top$$
对一个激活函数为softmax的单个神经元,记输入数据是由对n个特征进行m次观测所得的样本:
$$
X=\left[
\begin{matrix}
x_{10} & x_{11} & x_{12} & \cdots & x_{1n} \\
x_{20} & x_{21} & x_{22} & \cdots & x_{2n} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
x_{m0} & x_{m1} & x_{m2} & \cdots & x_{mn} \\
\end{matrix}
\right]
$$
其中,$\mathbf{X}$第一列数据全部为1,是人为加入的bias项,表示模型带有偏置(即考虑截距项)。$\mathbf{X}_{ij}(1\leq i\leq m,1\leq j\leq n)$表示第j个变量在第i次观测时的值。
该样本对应的真实类别为:
$$
Y=\left[
\begin{matrix}
y_{11} & y_{12} & \cdots & y_{1k} \\
y_{21} & y_{22} & \cdots & y_{2k} \\
\vdots & \vdots & \ddots & \vdots \\
y_{m1} & y_{m2} & \cdots & y_{mk} \\
\end{matrix}
\right]
$$
Y的每一行只有一个值为1,其它全为0。$y_{ij}=1 (1\leq i\leq m,1\leq j\leq k)$表示第i个样本对应的类别为第j类。
记待估计的参数为
$$
\Omega = \left[
\begin{matrix}
\omega_{01} & \omega_{02} & \cdots & \omega_{0k} \\
\omega_{11} & \omega_{12} & \cdots & \omega_{1k} \\
\omega_{21} & \omega_{22} & \cdots & \omega_{2k} \\
\vdots & \vdots & \ddots & \vdots \\
\omega_{n1} & \omega_{n2} & \cdots & \omega_{nk} \\
\end{matrix}
\right]
$$
记softmax函数的自变量为:
$$
Z = \left[
\begin{matrix}
z_{11} & z_{12} & \cdots & z_{1k} \\
z_{21} & z_{22} & \cdots & z_{2k} \\
\vdots & \vdots & \ddots & \vdots \\
z_{m1} & z_{m2} & \cdots & z_{mk} \\
\end{matrix}
\right]
$$
Z是模型输入数据的加权求和,即$z_{ij} = \sum_{k=0}^{n}x_{ik}\omega_{kj}$,$Z = X\Omega$。
记$$
\hat{Y} = softmax(Z) =
\left[
\begin{matrix}
\hat{y}_{11} & \hat{y}_{12} & \cdots & \hat{y}_{1k} \\
\hat{y}_{21} & \hat{y}_{22} & \cdots & \hat{y}_{2k} \\
\vdots & \vdots & \ddots & \vdots \\
\hat{y}_{m1} & \hat{y}_{m2} & \cdots & \hat{y}_{mk} \\
\end{matrix}
\right]
$$
其中$\hat{y}_{ij} = \frac{e^{z_{ij}}}{\sum_{p=1}^{k}e^{z_{ip}}}$,表示模型眼中第i个样本属于第j类的概率。
定义模型的总代价函数为
$$
COST(X) = \sum_{i=0}^{m}(-\sum_{j=1}^{k}y_{ij}\ln\hat{y}_{ij})
$$
将代价函数视为参数$\Omega$的函数$J_{\Omega}$,这就是我们要优化的目标。
使用梯度下降求解目标函数极小值
由链式法则可知,\begin{equation}
\frac{\partial J_\Omega}{\partial \omega_{pj}} = \sum_{i=1}^{m} \frac{\partial J_\Omega}{\partial z_{ij}}\frac{\partial z_{ij}}{\partial \omega_{pj}}
\end{equation}
而
\begin{equation}
\frac{\partial J_\Omega}{\partial z_{ip}}=\frac{\partial \sum_{q=0}^{m}(-\sum_{j=1}^{k}\ln\hat{y}_{qj})}{\partial z_{ip}}= -\sum_{j=1}^{k}\frac{{y}_{ij}}{\hat{y}_{ij}}\frac{\partial\hat{y}_{ij}}{\partial z_{ip}}
\end{equation}
由于
\begin{equation}
\frac{\partial\hat{y}_{ij}}{\partial z_{ip}}=\left\{
\begin{aligned}
& =\frac{e^{z_{ij}}}{\sum_{q=1}^{k}e^{z_{iq}}}-\frac{e^{z_{ij}}e^{z_{ip}}}{(\sum_{q=1}^{k}e^{z_{iq}})^2}=\hat{y}_{ip}-\hat{y}_{ip}^2 &j=p \\
& =-\frac{e^{z_{ij}}e^{z_{ip}}}{(\sum_{q=1}^{k}e^{z_{iq}})^2}=-\hat{y}_{ij}\hat{y_ip} &j\neq p \\
\end{aligned}
\right.
\end{equation}
将(3)式带入(2)式,由于$\sum_{j=1}^{k}y_{ij}=1$,有
\begin{equation}
\frac{\partial J_\Omega}{\partial z_{ip}}=\hat{y}_{ip}-y_{ip}
\end{equation}
考虑$z_{ip}=\sum_{q=0}^{n}x_{iq}\omega_{qp}$,并将(4)式带入(1)式,得到:
$$
\frac{\partial J_\Omega}{\partial \omega_{pj}}=\sum_{i=1}^{m}\frac{\partial J_\Omega}{\partial z_{ij}}x_{ip}=\sum_{i=1}^{m}(\hat{y}_{ij}-y_{ij})x_{ip}
$$
写回矩阵的形式,有:
$$
\nabla J_\Omega = \left[
\begin{matrix}
\frac{J_\Omega}{\omega_{01}} &\frac{J_\Omega}{\omega_{02}} & \cdots & \frac{J_\Omega}{\omega_{0k}} \\
\frac{J_\Omega}{\omega_{11}} &\frac{J_\Omega}{\omega_{12}} & \cdots & \frac{J_\Omega}{\omega_{1k}} \\
\vdots & \vdots & \ddots &\vdots \\
\frac{J_\Omega}{\omega_{n1}} &\frac{J_\Omega}{\omega_{n2}} & \cdots & \frac{J_\Omega}{\omega_{nk}} \\
\end{matrix}
\right]=\left[
\begin{matrix}
\sum_{i=1}^{m}x_{i0}(\hat{y}_{i1}-y_{i1}) & \sum_{i=1}^{m}x_{i0}(\hat{y}_{i2}-y_{i2}) & \cdots & \sum_{i=1}^{m}x_{i0}(\hat{y}_{ik}-y_{ik})\\
\sum_{i=1}^{m}x_{i1}(\hat{y}_{i1}-y_{i1}) & \sum_{i=1}^{m}x_{i1}(\hat{y}_{i2}-y_{i2}) & \cdots & \sum_{i=1}^{m}x_{i1}(\hat{y}_{ik}-y_{ik})\\
\vdots & \vdots & \ddots & \cdots \\
\sum_{i=1}^{m}x_{in}(\hat{y}_{i1}-y_{i1}) & \sum_{i=1}^{m}x_{in}(\hat{y}_{i2}-y_{i2}) & \cdots & \sum_{i=1}^{m}x_{in}(\hat{y}_{ik}-y_{ik})\\
\end{matrix}
\right]
$$
梯度已经求出来了,指定步长使用梯度下降方法求解即可。
代码如下。此文档的latex文件在这里。
import numpy as np import matplotlib.pyplot as plt ##使用iris数据作为测试数据,需要将数据文件'iris.data'放置在此文件的目录下 iris = [] target = [] with open('iris.data') as f: for line in f.readlines(): iris.append(line.strip().split(',')[0:4]) target.append(['Iris-setosa'==line.strip().split(',')[4], 'Iris-versicolor'==line.strip().split(',')[4],'Iris-virginica'==line.strip().split(',')[4]]) iris = np.array(iris).astype(float) target=np.array(target) def softmax(x_train,y_train): cost_trend = [] alpha = 3*10**(-3) max_iter = 3*10**4 tol = 0.05 x_train = np.hstack((np.ones(x_train.shape[0]).reshape(x_train.shape[0],1),x_train)) M,N = x_train.shape K = y_train.shape[1] omg = np.random.rand(N*K).reshape(N,K) for i in range(max_iter): grad = x_train.T.dot(np.exp(x_train@omg)/(np.exp(x_train@omg).sum(axis=1)).reshape(M,1) - y_train) omg -= alpha*grad cost = -(np.log(np.exp(x_train@omg) / (np.exp(x_train@omg).sum(axis=1).reshape(M,1))) * y_train).sum() if i in range(0,max_iter,200): cost_trend.append([i,cost]) if cost/M <tol: return omg,cost,cost_trend print('达到最大迭代步数,但模型尚未收敛到指定精度') return omg,cost,cost_trend def pred(x,omg): x = np.hstack((np.ones(x.shape[0]).reshape(x.shape[0],1),x)) return np.exp(x@omg)/np.exp(x@omg).sum(axis=1).reshape(x.shape[0],1) omg,meanerror,error = softmax(iris,target) pred = pred(iris,omg) print('模型在测试数据上的错误率为{:.2f}%'.format(float(sum(abs(pred.argmax(axis=1)-target.argmax(axis=1)))/150)*100)) import matplotlib.pyplot as plt plt.plot([item[0] for item in error[2:]],[item[1]for item in error[2:]]) plt.title('cost-iteration')
结果如下图: