20161206#cs231n#2.线性分类器 Assignment1--SVM&Softmax

最新推荐文章于 2024-06-12 17:14:47 发布

LiuSpark

最新推荐文章于 2024-06-12 17:14:47 发布

阅读量600

点赞数

分类专栏：机器学习

本文链接：https://blog.csdn.net/SPARKKKK/article/details/53516738

版权

机器学习专栏收录该内容

30 篇文章 0 订阅

订阅专栏

课程网址

Linear classification: Support Vector Machine, Softmax

Linear Classifier线性分类器

其实就是一个线性映射,
Score function:

f (x i, W, b) = W x i + b

$f(x_{i},W,b)=Wx_{i}+b$
f即为预测结果

yi $y_{i}$ ，W称之为weight，b为bias vector,其中

xi $x_{i}$ 为列向量
下面引用一个简单的例子

In the above equation, we are assuming that the image $x_{i}$ has all of its pixels flattened out to a single column vector of shape [D x 1]. The matrix W (of size [K x D]), and the vector b (of size [K x 1]) are the parameters of the function. In CIFAR-10, $x_{i}$ contains all pixels in the i-th image flattened into a single [3072 x 1] column, W is [10 x 3072] and b is [10 x 1], so 3072 numbers come into the function (the raw pixel values) and 10 numbers come out (the class scores).

优点在于对Training Set使用一次得到 W 和 b 之后就可以把它们discard，在代入Test Set 的数据，即可得到预测结果 $y_{i}$

线性分类器有个很大的问题就是会很死板地根据 Training Set 得到 W 的值，容易造成判断错误，这问题就需要神经网络去解决
一般是让 $f(x_{i},W,b)=Wx_{i}+b$ 变为

f (x i, W) = W x i

$f(x_{i},W)=Wx_{i}$
在

xi $x_{i}$ 中增加一个常量1的维度来代替bias，这样的话方程就简化了

With our CIFAR-10 example, $x_{i}$ is now [3073 x 1] instead of [3072 x 1] - (with the extra dimension holding the constant 1), and $W$ is now [10 x 3073] instead of [10 x 3072]. The extra column that $W$ now corresponds to the bias $b$ .
具体例子可以看网页

这样做的话，只需要增加一个维度即可实现只对一个矩阵 $W$ 进行学习，而不是既要对存储 $W$ 的矩阵进行学习又要对存储 $b$ 的矩阵进行学习

对于图像中的pixel要进行mean normalization均值归一化和Feature Scaling 特征缩放法
具体表现为[0,255]→[-127,127]→[-1, 1]这样做好像有点问题，会漏一些数据应该是[-128,127]

loss function

Loss function又称Cost Function又称Objective,Loss的值越小，表明对结果的预测越好，以下介绍线性分类器常用的两种loss，一种为Multiclass Support Vector Machine loss，一种为Softmax Classifier

Multiclass Support Vector Machine (SVM) loss 多类别支持向量机loss

这是一种常见的定义Loss Function的方法

s j = f (x i; W) j

$s_{j}=f(x_{i};W)_{j}$

L i = \sum j \neq y i max (0, s j - s y i + Δ)

$L_{i}=\sum\limits_{j≠y_{i}}\max(0,s_{j}-s_{y_{i}}+\Delta)$

L = 1 N \sum i \sum j \neq y i [max (0, f (x i; W) j - f (x i; W) y i + Δ)] + λ \sum k \sum l W 2 k, l

$L=\frac{1}{N}\sum\limits_{i}\sum\limits_{j≠y_{i}}[\max(0,f(x_{i};W)_{j}-f(x_{i};W)_{y_{i}}+\Delta)]+\lambda\sum\limits_{k}\sum\limits_{l}W_{k,l}^{2}$

Hinge loss： $\max(0,...)$ ，右边的 $...$ 代表某个数学表达式，其实就是一个以阈值为0的函数（感觉没什么特别的…)
Regularization Penalty正则惩罚项 $R(W)=\sum\limits_{k}\sum\limits_{l}W_{k,l}^{2}$
Data Loss ： $L=\frac{1}{N}\sum\limits_{i}L_{i}$
Regularization Loss ： $\lambda R(W)$
Margin ： $\Delta$ 一般取1.0

svm的分类是有方向性的，如cs231n图中的箭头，或者查看
知乎-靠靠靠谱的回答
 CSDN-SVM-支持向量机算法概述
Binary Support Vector Machines这个可以参考CS231n里面的解释

Softmax Classifier

其实Softmax分类器就是将逻辑回归分类器扩展到multiclass的层面上。
首先定义f为scores向量，类似于上面提到的s向量
Softmax Function: $f_{j}(z)=\frac{e^{z_{j}}}{\sum_{k}e^{z_{k}}}$
Cross-entropy Loss：

L i = - ln (e f y i \sum j e f j) o r e q u i v a l e n t l y L i = - f y i + ln \sum j e f j

$L_{i}=-\ln(\frac{e^{f_{y_{i}}}}{\sum_{j}e^{f_{j}}})\ \ or\ equivalently\ \ L_{i}=-f_{y_{i}}+\ln\sum\limits_{j}e^{f_{j}}$

Cross-entropy Loss的值要越小越好
Cross-entropy Loss越小即越接近0时，表明 $\frac{e^{f_{y_{i}}}}{\sum_{j}e^{f_{j}}}$ 的值越接近1，即softmax分类器预测为正确类 $y_{i}$ 的概率越大

下面是每个量的概率

P (y i | x i; W) = e f y i \sum j e f j

$P(y_{i}|x_{i};W)=\frac{e^{f_{y_{i}}}}{\sum_{j}e^{f_{j}}}$
其实很明显，指数的值特别大，所以分子分母的值都会特别大，所以需要一个合适的方法去减少计算量。
由于

e f y i \sum j e f j = C e f y i C \sum j e f j = e f y i + ln C \sum j e f j + ln C

$\frac{e^{f_{y_{i}}}}{\sum_{j}e^{f_{j}}}=\frac{Ce^{f_{y_{i}}}}{C\sum_{j}e^{f_{j}}}=\frac{e^{f_{y_{i}}+\ln{C}}}{\sum_{j}e^{f_{j}+\ln{C}}}$
所以我们可以令

lnC=−maxjfj $\ln{C}=-\max_{j}{f_{j}}$ ，这样的话分子的和分母的幂即

fj+lnC $f_{j}+\ln{C}$ 的最大值为0，有效避免了分子分母的

efj+lnC $e^{f_{j}+\ln{C}}$ 过大的问题

所有的 $P(y_{i}|x_{i};W)$ 的和值为1

SVM与Softmax Classifier的比较

与SVM相比softmax分类器给每一个类都提供了一个确信度，而SVM只是给了一个具体的值
对于 $\lambda$ 它对softmax的影响很大， $\lambda$ 值很小的时候可能会出现特别大的概率值，但 $\lambda$ 值大一点可能会使每一个类的概率值相对接近
$\lambda$ 的值会直接影响scores进而间接影响最后的概率值

$[1,−2,0]→[e^{1},e^{-2},e^{0}]=[2.71,0.14,1]→[0.7,0.04,0.26]$
增大 $\lambda$ ，使 $W$ 被惩罚更多，导致scores的值变小([0.5,−1,0]),最终影响概率，使每个概率值相对更接近
$[0.5,−1,0]→[e^{0.5},e^{-2},e^{0}]=[1.65,0.37,1]→[0.55,0.12,0.33]$

但对于SVM（令 $\Delta=1$ ）而言，scores的值[10, -100, -100]和 [10, 9, 9] 没有什么差别，因为最后的loss值都为0

任务：解释线性分类器

线性分类器就是用 $f(x_{i},W)=Wx_{i}$ 做出一个超平面，把不同的类的点分隔在平面的两侧，规定面的一边为正方向，这个正方向内的所有点就为线性分类器的预测结果。
线性分类器的超平面是用训练集训练出来的，其中最关键的就是 $W$ 。 $W$ 可认为是模板，每一行是用于估测同一个类的不同参数，与 $x_{i}$ 做内积的结果 $f(x_{i},W)$ 就是根据这些参数估测出来的不同类的scores，对scores进行Loss Function处理，即可挑出最合适的那个类。

线性分类器这种参数化方法(Parametric Approach)相比于kNN而言的好处在于不用多次遍历训练集，只要遍历过一次训练集之后，即可得到 $W$ 参数，即可丢弃训练集。在之后对test和 $W$ 做矩阵乘法即可估计scores。

对于如何得到最合适的参数使得loss值最小，这就是最优化问题了(Optimization)

Assignment1–SVM

这里涉及到了矩阵求导
所以特意去查了点公式矩阵导数这里面用到的是7.标量y对矩阵X的求导的重要结论

在cs231n给的Assignment里面（这个定义很奇怪但里面的确是这么写的）

L i j = X i \cdot W j - X i \cdot W y i + 1 X \in N \times D W \in D \times C X i \in 1 \times N W j, W y i \in D \times 1

$L_{ij}=X_i·W_j-X_i·W_{y_{i}}+1\\ X\in N×D\ \ W\in D×C\\X_i\in 1×N\ \ \ W_j,W_{y_{i}}\in{D×1}$
其中D是输入图像转化之后的维数，C是Classes数目，N是一个minibatch的样本数

初始化的时候

d W = [0, 0, . ., 0]

$dW=[0,0,..,0]$ (其中

0 $0$ 为D×1维的列向量)

$k!=y_i$ 的时候有

\partial L i j \partial W k = X T i

$\frac{\partial L_{ij}}{\partial W_k}=X_i^T$

k==yi $k==y_i$ 的时候有

\partial L i j \partial W k = - X T i

$\frac{\partial L_{ij}}{\partial W_k}=-X_i^T$
（这里就是把

Wj $W_j$ 当做变量来求导）

\partial L i j \partial W = \partial ( X i \cdot W j - X i \cdot W y i + 1 ) \partial [ W 1 , W 2 , . . . , W N u m _ C l a s s e s ] = [0, 0, . . ., X T i, . . ., - X T i, . . ., 0]

$\begin{split} \frac{\partial L_{ij}}{\partial W}&=\frac{\partial (X_i·W_j-X_i·W_{y_{i}}+1)}{\partial [W_1,W_2,...,W_{Num\_Classes}]}\\&=[0,0,...,X_i^T,...,-X_i^T,...,0] \end{split}$

将上述式子求 $dW +=\frac{\partial L_{ij}}{\partial W}$ ，循环 $i*j$ 次,然后之后便可以得到最后的Gradient值dW
具体看电脑里面的代码
参考
http://blog.csdn.net/zengdong_1991/article/details/51346201
http://blog.csdn.net/yc461515457/article/details/51921607

Assignment2–Softmax

$W\in D×C\ \ \ X\in N×D$
loss function定义为

L i = - f y i + log \sum j e f j

$L_i=-f_{y_i}+\log \sum\limits_j e^{f_j}$
所以

\partial L i \partial W = \partial ( - f y i + log \sum j e f j ) \partial W

$\frac{\partial L_i}{\partial W}=\frac{\partial (-f_{y_i}+\log \sum\limits_j e^{f_j})}{\partial W}$
http://blog.csdn.net/yc461515457/article/details/51924604
http://blog.csdn.net/xieyi4650/article/details/53332988

softmax.py

import numpy as np
from random import shuffle

def softmax_loss_naive(W, X, y, reg):
  """
  Softmax loss function, naive implementation (with loops)

  Inputs have dimension D, there are C classes, and we operate on minibatches
  of N examples.

  Inputs:
  - W: A numpy array of shape (D, C) containing weights.
  - X: A numpy array of shape (N, D) containing a minibatch of data.
  - y: A numpy array of shape (N,) containing training labels; y[i] = c means
    that X[i] has label c, where 0 <= c < C.
  - reg: (float) regularization strength

  Returns a tuple of:
  - loss as single float
  - gradient with respect to weights W; an array of same shape as W
  """
  # Initialize the loss and gradient to zero.
  loss = 0.0
  dW = np.zeros_like(W)

  #############################################################################
  # TODO: Compute the softmax loss and its gradient using explicit loops.     #
  # Store the loss in loss and the gradient in dW. If you are not careful     #
  # here, it is easy to run into numeric instability. Don't forget the        #
  # regularization!                                                           #
  #############################################################################

  scores=X.dot(W)
  num_trains = scores.shape[0]
  scores_max=np.max(scores, axis=1)
  scores -= scores_max[:, np.newaxis]
  scores_exp = np.exp(scores)
  scores_exp_sum = np.sum(scores_exp, axis=1)
  p = np.zeros(scores.shape)

  for i in xrange(num_trains):
    p[i, :] = scores_exp[i, :] / scores_exp_sum[i]
    loss -= np.log(p[i, y[i]])

  for i in xrange(num_trains):
    dW += (X[i][:,np.newaxis])*p[i]
    dW[:,y[i]] -= X[i,:].T
  loss /= num_trains
  loss += 0.5 * reg * np.sum(W * W)
  dW = dW/num_trains+reg*W
  #############################################################################
  #                          END OF YOUR CODE                                 #
  #############################################################################

  return loss, dW


def softmax_loss_vectorized(W, X, y, reg):
  """
  Softmax loss function, vectorized version.

  Inputs and outputs are the same as softmax_loss_naive.
  """
  # Initialize the loss and gradient to zero.
  loss = 0.0
  dW = np.zeros_like(W)

  #############################################################################
  # TODO: Compute the softmax loss and its gradient using no explicit loops.  #
  # Store the loss in loss and the gradient in dW. If you are not careful     #
  # here, it is easy to run into numeric instability. Don't forget the        #
  # regularization!                                                           #
  #############################################################################
  scores = X.dot(W)
  num_trains = scores.shape[0]
  scores_max=np.max(scores, axis=1)
  scores -= scores_max[:, np.newaxis]
  scores_exp = np.exp(scores)
  scores_exp_sum = np.sum(scores_exp, axis=1)
  p = np.zeros(scores.shape)
  p= scores_exp/scores_exp_sum[:,np.newaxis]
  loss = np.log(p[np.arange(num_trains),y]).sum()
  loss = -loss
  loss /= num_trains
  loss += 0.5 * reg * np.sum(W * W)
  p[np.arange(num_trains), y]-=1
  dW=(X.T).dot(p)/num_trains +reg*W

  #############################################################################
  #                          END OF YOUR CODE                                 #
  #############################################################################

  return loss, dW

提醒

  scores_max=np.max(scores, axis=1)
  scores -= scores_max[:, np.newaxis]

注意这里的np.newaxis
举个例子

>>> a=np.arange(12).reshape(3,4)
>>> a
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
>>> d=np.max(a,axis=1)
>>> d
array([ 3,  7, 11])
>>> a-d[:,np.newaxis]
array([[-3, -2, -1,  0],
       [-3, -2, -1,  0],
       [-3, -2, -1,  0]])
>>> d[:,np.newaxis]
array([[ 3],
       [ 7],
       [11]])
>>> d.reshape(3,1)
array([[ 3],
       [ 7],
       [11]])

所以以后在numpy里面矢量化需要注意这个问题

行向量直接加.T并不能变为列向量，如下必须使用np.newaxis

>>> b=np.arange(12)
>>> b
array([ 0,  1,  2, ...,  9, 10, 11])
>>> b.T
array([ 0,  1,  2, ...,  9, 10, 11])
>>> b[:,np.newaxis]
array([[ 0],
       [ 1],
       [ 2],
       ..., 
       [ 9],
       [10],
       [11]])

代码看参考网址

LiuSpark

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
20161206#cs231n#2.线性分类器 Assignment1--SVM&Softmax

Linear classification: Support Vector Machine, SoftmaxLinear Classifier线性分类器
复制链接

扫一扫