CNN 笔记
- detection: bounding box
- segementation: pixel by pixel
Convolution Layer
- convolve the filter with the image(dot products)
- extend the full depth
- first stretch filter to a vector(5*5*3 -> 1*75), then do dot products
- 实际上把filter放在图像上,做一个点对点的乘积,结果就是中心的点的值
- 不是信号处理的convolve
- a set of multiple filters(N), N activation maps
- longer filters for deeper depth
eg:
32x32x3–> 28x28x6(6 feature map)
CONV->ReLU->CONV->ReLU->POOLING->CONV->ReLU->CONV->ReLU->POOLING->CONV->ReLU->CONV->ReLU->POOLING->FULL CONNECT
ConvNet is a sequence of Convolution Layers, intersperesed with activation functions
(N−F)/stride+1 ( N − F ) / s t r i d e + 1common: zero pad the border
- parameters, always have 1 bias term for each filter
Pooling Layer
- makes the representations smaller and more manageable
- invariance over a given region
- downsampling, not operate on depth
- MAX POOLING
- common is no overlap
- better
- common no zero-padding
- stride can also be used for downsampling instead
of pooling
Fully Connected Layer(FC layer)
typical arch:
[(CONV−RELU)∗N−POOL?]∗M−(FC−RELU)∗K,SOFTMAX
[
(
C
O
N
V
−
R
E
L
U
)
∗
N
−
P
O
O
L
?
]
∗
M
−
(
F
C
−
R
E
L
U
)
∗
K
,
S
O
F
T
M
A
X
Useful Notes
- cs231 three network notes
Preprocessing
- Mean subtraction:
X -= np.mean(X, axis=0)
- Normalization:
X /= np.std(X, axis=0)
- PCA: saving space and time
- whitening:
- any preprocessing statistics (e.g. the data mean) must only be computed on the training data, and then applied to the validation / test data.
Weigth Initialization
- small random numbers:
W = 0.01* np.random.rand(D, H)
- Calibrating the variances with 1/sqrt(n)
- Bacth Normalization:
- 防止梯度弥散
- 加快训练速度
Regularization
- L2 reularization
- Max norm constraints
- Dropout
- practice:
- use a single, global L2 regularization strength
- with dropout (p = 0.5)
Loss
classification
- hinge loss
- cross-entropy loss
- large number classes: Hierachical softmax
Attribute classification
build a binary classifier for every single attribute independently
Li=∑jmax(0,1−yijfj)
L
i
=
∑
j
m
a
x
(
0
,
1
−
y
i
j
f
j
)
- yij y i j is either +1 or -1 depending on whether the i-th example is labeled with the j-th attribute
or train a logistic regression classifier for every attribute independently
P(y=1∣x;w,b)=11+e−(wTx+b)=σ(wTx+b)
P
(
y
=
1
∣
x
;
w
,
b
)
=
1
1
+
e
−
(
w
T
x
+
b
)
=
σ
(
w
T
x
+
b
)
Li=∑jyijlog(σ(fj))+(1−yij)log(1−σ(fj))
L
i
=
∑
j
y
i
j
log
(
σ
(
f
j
)
)
+
(
1
−
y
i
j
)
log
(
1
−
σ
(
f
j
)
)
- gradient is ∂Li/∂fj=yij−σ(fj) ∂ L i / ∂ f j = y i j − σ ( f j )
regression
Li=‖f−yi‖22 L i = ‖ f − y i ‖ 2 2
- not stable
- softmax loss more better
Summary
- have mean of zero, and normalize its scale to [-1, 1] along each feature
- W using gaussian distribution with standard deviation of 2/n‾‾‾√ 2 / n , n is number of inputs to the neuron
- L2 regularization and dropout
- batch normalization
Later
notes3
Todos
- reading batch normalization
- reading notes3