本周的课程还是很有意思的,可以让人知道神奇的照片凡高化是怎么做出来的。在创新地应用上,最重要的还是如何让问题形式化。问题能够定义清楚就成功了一大半。
1. Face Recognition
1.1 定义
(Face) Verification
- Input image, name/ID
- Output whether the input image is that of claimed person
(Face) Recognition
- Has a database of K persons
- Get an input image
- Output ID if the image is any of the K persons (or “not recognized”)
难点:One Shot Learning.
Learning from one example to recognize the person again. 只提供一张图片就能完成后续的识别。
1.2 Siamese Network
Learning a similarity function
if d(image1, image2)
≤
τ
\le \tau
≤τ, “same”
else d(image1, image2)
>
τ
\gt \tau
>τ, “different”
1.2.1 Encoding of image
d ( x ( 1 ) , x ( 2 ) ) = ∥ f ( x ( 1 ) ) − f ( x ( 2 ) ) ∥ 2 2 d(x^{(1)}, x^{(2)}) = \left \| f(x^{(1)})-f(x^{(2)}) \right \|^2_2 d(x(1),x(2))=∥∥∥f(x(1))−f(x(2))∥∥∥22
1.2.2 Loss
- Triplet Loss
APN - Anchor/Positive/Negative
∥ f ( A ) − f ( P ) ∥ 2 + α ≤ ∥ f ( A ) − f ( N ) ∥ 2 \left \| f(A) - f(P) \right \|^2 + \alpha \le \left \| f(A) - f(N) \right \|^2 ∥f(A)−f(P)∥2+α≤∥f(A)−f(N)∥2
α \alpha α is margin, 能够让正负样本之间的距离更明显。
L ( A , P , N ) = max ( ∥ f ( A ) − f ( P ) ∥ 2 − ∥ f ( A ) − f ( N ) ∥ 2 + α , 0 ) \mathcal{L}(A, P, N) = \max ( \left \| f(A) - f(P) \right \|^2 - \left \| f(A) - f(N) \right \|^2 + \alpha, 0 ) L(A,P,N)=max(∥f(A)−f(P)∥2−∥f(A)−f(N)∥2+α,0)
J
=
∑
i
=
1
m
L
(
A
(
i
)
,
P
(
i
)
,
N
(
i
)
)
J = \sum^m_{i=1} \mathcal{L}(A^{(i)}, P^{(i)}, N^{(i)})
J=i=1∑mL(A(i),P(i),N(i))
Choose triplets that are “hard” to train on.
- Binary Classification
或者说将这个问题转换成一个二分类问题
y ^ = σ ( ∑ k = 1 128 w i ∣ f ( x ( i ) ) k − f ( x ( j ) ) k ∣ + b ) \hat{y} = \sigma \left( \sum_{k=1}^{128} w_i \left| f(x^{(i)})_k - f(x^{(j)})_k \right | + b \right) y^=σ(k=1∑128wi∣∣∣f(x(i))k−f(x(j))k∣∣∣+b)
2. Neural Style Transfer
定义: Content/Sytle/Generated image,将C按照S生成G。
Cost Function
J
(
G
)
=
α
J
content
(
C
,
G
)
+
β
J
s
t
y
l
e
(
S
,
G
)
J(G) = \alpha J_\text{content}(C, G) + \beta J_{style}(S, G)
J(G)=αJcontent(C,G)+βJstyle(S,G)
代价函数由两部分构成。
一部分是内容相似的得分:计算隐藏层的activation输出
J
content
=
1
2
∥
a
[
l
]
(
C
)
−
a
[
l
]
[
G
]
∥
J_\text{content} = \frac{1}{2} \left \| a^{[l](C)} - a^{[l][G]} \right \|
Jcontent=21∥∥∥a[l](C)−a[l][G]∥∥∥
一部分是风格流派分:
定义Style:Correlation between activations across channels.
G
k
k
′
[
l
]
=
∑
i
=
1
n
H
[
l
]
∑
i
=
j
n
W
[
l
]
a
i
j
k
[
l
]
a
i
j
k
′
[
l
]
G_{kk'}^{[l]} = \sum_{i=1}^{n_H^{[l]}} \sum_{i=j}^{n_W^{[l]}} a_{ijk}^{[l]}a_{ijk'}^{[l]}
Gkk′[l]=i=1∑nH[l]i=j∑nW[l]aijk[l]aijk′[l]
G
[
l
]
G^{[l]}
G[l]是
n
c
[
l
]
×
n
c
[
l
]
n_c^{[l]} \times n_c^{[l]}
nc[l]×nc[l]的Gram Matrix,k/k’都是channel上的。l表示是在第l层的activation上。然后分别在S和G上计算:
J
style
[
l
]
(
S
,
G
)
=
1
(
2
n
H
[
l
]
n
W
[
l
]
n
C
[
l
]
)
2
∥
G
[
l
]
[
S
]
−
G
[
l
]
[
G
]
∥
F
2
J^{[l]}_\text{style}(S, G) = \frac {1} {(2n_H^{[l]}n_W^{[l]}n_C^{[l]})^2} \left \| G^{[l][S]} - G^{[l][G]} \right \|_F^2
Jstyle[l](S,G)=(2nH[l]nW[l]nC[l])21∥∥∥G[l][S]−G[l][G]∥∥∥F2
最后,在多个隐层上进行计算。
3. What are deep ConvNets learning?
浅层学到的是基础特征:第一层不同的filter就是在提取不同的边特征;
深层学到的是组合程度更强的:比如耳朵、鼻子之类的局部组件。