CS231N学习笔记3 Linear Classification

最新推荐文章于 2021-01-28 21:20:57 发布

dancinglikelink

最新推荐文章于 2021-01-28 21:20:57 发布

阅读量723

点赞数

分类专栏： CS231n学习笔记文章标签： cs231n

本文链接：https://blog.csdn.net/Chrome_matrix_68/article/details/78428835

版权

CS231n学习笔记专栏收录该内容

5 篇文章 0 订阅

订阅专栏

We will then cast this as an optimization problem in which we will minimize the loss function with respect to the parameters of the score function.

我们将其转化为一个优化问题，目的是通过优化评分函数的参数最小化损失函数。

Parameterized mapping from images to label scores

Score function maps the raw image pixels to class scores.

D是图像pixel维度，K是class的数量。

线性分类器：

一个最简单的score function：

Xi即输入图像， has all of its pixels flattened out to a single column vector of shape [D x 1]

参数包含：The matrix W (of size [K x D]), and the vector b (of size [K x 1])

注：b被称作 bias vector ，因为它虽然影响评分，但是 without interacting with the actual data .

Advantages：

一旦训练完成，training data可以被discarded，只保留参数即可。

Classifying the test image involves a single matrix multiplication and addition，因此速度比和整个training data作比较快很多（相对于KNN）。

Interpreting a linear classifier

Weight本质是，对于某一类，比如船，分类器对图像上每个pixel每个通道颜色的like和dislike。对于船，蓝色的pixel得到的score更高（可能与海洋有关）。

Analogy of images as high-dimensional points

每个图片可以被认为是一个高维空间中的点。e.g. each image in CIFAR-10 is a point in 3072-dimensional space of 32x32x3 pixels

注：参数的意义

权重w：旋转line

偏置向量b：平移line，如果没有b，xi=0的点将会永远待在零点。

Interpretation of linear classifiers as template matching.

W中每一行实际就是一个class的本质体现，每次classfy就是将图像与w的每一行作内积。

和KNN类比：

我们实际上还是在做KNN，但是并不是把每张test image和traing data中的每一张做匹配，找到最相近的一张图，其label作为结果。

而是我们利用training data训练出10个分类的10张template图，我们把每张test image和这10个template作匹配，找到最相近的template，其label作为结果。

我们用内积代替了L1和L2

颜色的分布，也暗示了training data中红色的cat比较多，比如car的template中红色范围更多。

Bias trick

在xi末尾添加一个常数1，把矩阵简化：

这样score function从变成了

Image data preprocessing

归一化数据，集中数据，去除平均值？

Loss function

衡量prediction和ground truth的差别。Loss越高，分类器越差。

Multiclass Support Vector Machine loss

The Multiclass SVM loss for the i-th example is then formalized as follows:

对于第i个例子，对于它被分为3个类的score分别为s1,s2,s3,而ground truth是其应该被放进类1,所以计算s2/s3和s1的差值并加上梯度delta.一些比s1高的分数没有多于delta的,将会取0,而其余则取他们的差值.

In summary, the SVM loss function wants the score of the correct class yi to be larger than the incorrect class scores by at least by Δ (delta).如果不是，则这一项变为0.

鉴于前面对s的取值公式,则L可被写作:

这种max loss,即max(0,−)常被称作hinge loss,有时用他的平方,即max(0,−)2

作为loss,使得惩罚更强烈(二次方而不是线性)!

选择平方版本还是标准版本可以通过cross-validation来决定.

If any class has a score inside the red region (or higher), then there will be accumulated loss. Otherwise the loss will be zero.

Regularization

如果有这么一组权重,可以使得所有的example的loss都是0,那么可以得到无数种这样的W,只要简单地乘以一个数就可以.为了避免这种现象,在loss中添加了一个penalty:

这样可以得到最终的loss,其中左边是对于所有examples,Loss的均值.

展开得到(这里N是examples的个数,λ由cross-validation决定)

左边i是每个example,j是每个example里除了ground Truth对应的label外,其他label的hinge loss.

右边k是每个类(例如k=10),l是image flatted之后的维度 (例如l=3072)

λ是正则化的强度.

Penalty的一些优势: 趋向于选择更小更弥散的weights,可以less overfitting.同时,因为penalty的存在,we can never achieve loss of exactly 0.0 on all examples(只有w全部是0的时候才会出现).

All we have to do now is to come up with a way to find the weights that minimize the loss.

Practical Considerations

需要注意到,delta的设置并不会很大的影响loss,真正影响的是w的大小,他会直接影响score的大小,以及不同classes的scores之间的差异.delta可以被safely设置成1.

Binary Support Vector Machine的损失函数定义如下,这里C和λ都控制着同样的权衡:data loss和regularization loss,并且这里C∝1/λ.

备注部分:在初始行驶中进行最优化,损失函数通常不可微分,但是可以使用次梯度;其他版本的多元SVM:OVA,AVA等等. 使用文中的版本you can construct multiclass datasets where this version can achieve zero data loss, but OVA cannot.

Softmax classifier

Softmax分类器是二元逻辑斯特回归的多元一般形式.

它的loss function与SVM的不同在于,修改了hinge loss部分,变成了cross-entropy loss.

对于每一个example Li, j是分类结果class中的一个,fyi是把i分类为yi的score,这样可以将Li控制在[0,1]之间

信息论知识

真实分布p和预测分布q的交叉熵被定义为:

我们希望,真实分布和预测分布之间的相对熵最小,即差异最小.也就是希望所有的预测概率分布能和真实概率分布一致.

Practical issues: Numeric stability

由于指数级别的数值非常大,除以一个这样的数会导致数值不稳定性.用上下同时乘以C的方式增强数值稳定性.通常取:logC=−maxjfj

这样使得新的指数必然<=0,这样除号上下的数都必然小于1,不会造成指数爆炸问题.

名字的误解

SVM使用hinge loss,而softmax使用交叉熵loss,softmax本质指的是一种函数,他们将数值归一化,因此并没有softmax loss这一说法.

SVM和softmax的区别

上图以一张图xi为例,由于我们的目标是降低loss(最好是0).

我们可以发现,SVM鼓励的是让ground truth的score比其他类别的score高出一个margin,越高越好,越高,loss越低.

而softmax鼓励的是让-log(正确分类的概率)越低越好,也就是正确分类的概率越高越好.

Softmax也提供了每个分类结果的概率.但是这个概率是加引号的!因为随着λ的惩罚增强,w会降低,因此”概率”会弥散.

对于softmax来说,data loss是永远可以被降低的,但是不会达到0,但是SVM只要达到边界,就认为可以了.

Unknown words

Inefficient 效率

Cast 铸造，计算

Optimization 优化

concrete example 具体实例

Arguably 可以说

Foreshadowing 伏笔

For the sake of 为了

Monochrome 单色

squashing 挤压

Prototype 原型

with the terminology…… 根据……理论

Cumbersome 笨重的

normalization 归一化

Subtracting 减去

Scale 天平，测量，规范

Anthropomorphise 拟人化

in the sense 在这个意义上

Yield 生产，产量

Abbreviate 缩写

Clamped 夹紧

If this is not the case 如果不是这样的话

Desired 渴望的

Terminology 术语

hinge loss 铰链损失

Quadratically 二次的

quantifies 量化

constraint 约束

Uniformly 一致地

Magnitudes 量

regularization penalty 正则化

Generalization 一般化

Diffuse 弥漫的

negligible 可以忽略不计的

pathological 病态的,不理智的

brush over 刷过(温习一遍)

Tradeoff 权衡

Arbitrarily 任意的

Reciprocal 倒数

Unconstrained 无约束的

Kink 扭结,不可导点

Differentiable 可微分的

Subgradient 次梯度

Scope 范围

Uncalibrated 未校准

Squash 挤入

Distribution 分布

Peaky 憔悴的

Scenarios 情节,情况

dancinglikelink

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
CS231N学习笔记3 Linear Classification

We will then cast this as an optimization problem in which we will minimize the loss function with respect to the parameters of the score function. 我们将其转化为一个优化问题，目的是通过优化评分函数的参数最小化损失函数。 Paramet
复制链接

扫一扫