文章目录
KNN-P5
适合图像的衡量:L2距离不适合图像之间的视觉感知差异
存在的问题-维度灾难:许多指数级的样本来使训练数据能够密集的分布在空间中,否则最近邻点的实际距离可能很远,和待测样本的相似性不高。
线性分类器-P6
是基本模块,最简单的参数模型 f(x,W)=W*x+b
泛化能力,像乐高一样可以叠起来
但是决策边界是线性的,即使是很高维的,效果不好
计算图,计算节点,链式法则求梯度
每个
W
i
W_i
Wi代表了一个样本,W所有范本的加权
CNN
W代表卷积核,其实也可以是展成1维的然后点积
stride 2: 输出3X3
这样的零填充保持了输出的大小不变;
填充就是在角落增加了一些额外的特征。
输出的深度:
pooling layer: downsample 不在深度方向变化,不需要零填充
(滑动步长也是再做downsample)
激活函数P13
2.这样导致更新特别低效
只允许横向或者纵向,但是实际最优的是蓝色的
同理,tanh仍然存在梯度消失的问题
ReLU的负半轴:dead ReLU开始时训练的好好地,后面突然坏掉
data preprocessing
零均值化,(有偏差就可能导致次最优中的最优)
标准差来归一化(all features are in the same range, contributes equally)
图像的话通常不做归一化
测试机也要做预处理(拿从训练集获得的数据如均值等)
通道指RGB,所以第二个是三个通道都分别减去3个数(对应通道的均值)
同样的,神经网络中每层的输入的均值为0方差为1更易于学习
weight initialization
反向传播:我们每层有很小的输入值,由于x很小,通过权重得到的梯度也很小,相乘的结果几乎是0。所以基本上没有更新。
所有的数据越来越小,梯度和上层梯度几乎为0
Batch Normalization
BN在做的事就是将数据转为单位高斯的(实际input并不需要完全符合高斯分布,差不多像就行)
在我们想要的高斯范围内保持激活
每一层都是unit 高斯激活
测试阶段也同样使用在训练时用的均值和方差进行BN
训练时均值和方差都是用mini-data计算的,测试时候使用running_mean(从训练的衰减得到的)
def batchnorm_forward(x, gamma, beta, bn_param):
""" Input:
- x: Data of shape (N, D)
- gamma: Scale parameter of shape (D,)
- beta: Shift paremeter of shape (D,)
- bn_param: Dictionary with the following keys:
- mode: 'train' or 'test'; required
- eps: Constant for numeric stability
- momentum: Constant for running mean / variance.
- running_mean: Array of shape (D,) giving running mean of features
- running_var Array of shape (D,) giving running variance of features
Returns a tuple of:
- out: of shape (N, D)
- cache: A tuple of values needed in the backward pass
"""
mode = bn_param["mode"]
eps = bn_param.get("eps", 1e-5)
momentum = bn_param.get("momentum", 0.9)
N, D = x.shape
running_mean = bn_param.get("running_mean", np.zeros(D, dtype=x.dtype))
running_var = bn_param.get("running_var", np.zeros(D, dtype=x.dtype))
out, cache = None, None
if mode == "train":
sample_mean = x.mean(axis=0)
sample_var = x.var(axis=0)
sample_x = (x - sample_mean) / (np.sqrt(sample_var + eps))
out = gamma * sample_x + beta
running_mean = momentum * running_mean + (1 - momentum) * sample_mean
running_var = momentum * running_var + (1 - momentum) * sample_var
cache = sample_x, gamma, beta, x, sample_var, sample_mean, eps
elif mode == "test":
sample_x = (x - running_mean) / (np.sqrt(running_var + eps))
out = gamma * sample_x + beta
else:
raise ValueError('Invalid forward batchnorm mode "%s"' % mode)
# Store the updated running means back into bn_param
bn_param["running_mean"] = running_mean
bn_param["running_var"] = running_var
return out, cache
def batchnorm_backward(dout, cache):
""" For this implementation, you should write out a computation graph for
batch normalization on paper and propagate gradients backward through
intermediate nodes.
Inputs:
- dout: Upstream derivatives, of shape (N, D)
- cache: Variable of intermediates from batchnorm_forward.
Returns a tuple of:
- dx: Gradient with respect to inputs x, of shape (N, D)
- dgamma: Gradient with respect to scale parameter gamma, of shape (D,)
- dbeta: Gradient with respect to shift parameter beta, of shape (D,)
"""
dx, dgamma, dbeta = None, None, None
sample_x, gamma, beta, x, sample_var, sample_mean, eps = cache
m = x.shape[0]
dgamma = np.sum((dout * sample_x),axis=0)
dbeta = np.sum(dout,axis=0)
dsample_x = dout * gamma
dvar = np.sum(dsample_x * (x-sample_mean) * (-0.5) * np.power(sample_var+eps,-1.5),axis=0)
dmean = np.sum((dsample_x / -np.power(sample_var+eps,0.5)) + dvar * np.mean(-2*(x-sample_mean)),axis=0)
dx = dsample_x / np.power(sample_var+eps,0.5) + dvar * 2 * (x-sample_mean) / m + dmean / m
return dx, dgamma, dbeta
观察训练过程 调参
其他optimization
局部极小值出现的情况反而少
+Momentum效果更好的理解:即使鞍点或者局部极小处梯度很小或为0,但是在该点仍然有速度,不会停留在该处
平缓的极值点往往具又比较好的泛化能力,太斗的极值点可能是过拟合等有问题了,(可以扩大数据集再试试或者优化函数会发生很大变化)
此外,,随着时间的增长,更新步长会变得越来越小(grad_squared
单调递增)
当时非凸函数时,会陷入局部最优。因此有了RMSProp(但是可能 训练会一直在变慢)
因此有了下面的形式:
Adam YYDS!结合了带动量的SGD和RMSProp
*先不使用lr decay,设置好合适的lr,在观察loss曲线,看在哪个地方进行decay合适
正则化
提高single-model performance的一种方法。是在正向传播时进行的
置0的是激活函数:每一层都是在计算上一个激活函数的结果乘以权重矩阵得到next activation. Then e just take the activation, set some of them to zero. Then the next layer will be partially zeroed activations乘以另一个权重矩阵得到下一个激活函数的输入
一般在全连接层使用,在卷积层might drop entire feature maps randomly.(把几个通道置0而不是某几个元素)
分析原因:dropout helps prevent co-adaptation of features.
预测时要用输出层的输出乘以dropout的概率
*x.shape
转化为整数,因为np.random.rand()
里面只能是整数
训练中dropout对梯度有啥影响?训练需要更长的时间,因为在每一步,只是更新了网络的一部分,置0的神经元没有梯度。但是收敛后鲁棒性更好
更通用的方法:在训练时,加一些随机性(即扰动)给网络,防止过拟合;在测试时,抵消所有的随机性,希望能提高泛化能力。(BN就能达到这种效果)
def dropout_forward(x, dropout_param):
"""
- x: Input data, of any shape
- dropout_param: A dictionary with the following keys:
- p: Dropout parameter. We keep each neuron output with probability p.
- mode: 'test' or 'train'. If the mode is train, then perform dropout;
if the mode is test, then just return the input.
- seed: Seed for the random number generator. Passing seed makes this
function deterministic, which is needed for gradient checking but not
in real networks.
Outputs:
- out: Array of the same shape as x.
- cache: tuple (dropout_param, mask). In training mode, mask is the dropout
mask that was used to multiply the input; in test mode, mask is None.
"""
p, mode = dropout_param["p"], dropout_param["mode"]
if "seed" in dropout_param:
np.random.seed(dropout_param["seed"])
mask = None
out = None
if mode == "train":
mask = (np.random.rand(*x.shape)<p)/p
out = mask * x
elif mode == "test":
#测试阶段unchanged
out = x.copy()
cache = (dropout_param, mask)
out = out.astype(x.dtype, copy=False)
return out, cache
def dropout_backward(dout, cache):
"""Backward pass for inverted dropout.
Inputs:
- dout: Upstream derivatives, of any shape
- cache: (dropout_param, mask) from dropout_forward.
"""
dropout_param, mask = cache
mode = dropout_param["mode"]
dx = None
if mode == "train":
dx = dout * mask
elif mode == "test":
dx = dout
return dx
DropConnect: 随机将权重矩阵中的一些值置0
迁移学习
小样本集容易过拟合,但是有没有那么多的样本。
代码积累
- np.hstack()
示例:
X_train = np.hstack([X_train, np.ones((X_train.shape[0], 1))])
其中X_train是大小为(49000,3072)的数组。
现假设a = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16],[17,18,19,20]])
即a是一个大小为(5,4)的数组。
再假设b = np.hstack([a, np.ones((a.shape[0],1))])
结果为
array([[ 1., 2., 3., 4., 1.],
[ 5., 6., 7., 8., 1.],
[ 9., 10., 11., 12., 1.],
[13., 14., 15., 16., 1.],
[17., 18., 19., 20., 1.]])
可以看出np.hstack()是在水平方向上平铺,示例所要达到的效果就是在3072维的数据最后增加值为1的维度
作业
Assignment1
SVM
求导
(P8利用计算图+链式法则求导)
L1和L2正则项解释分析
向量化
import numpy as np
shape=(3,5,6,7)
print(shape[0])
print(shape[1:])
print(np.prod(shape[1:]))
"""
3
(5,6,7)
210
"""