《Siamese Neural Networks for One-shot Image Recognition》笔记

1 Motivation

  • Machine learning often break down when forced to make predictions about data for which little supervised information is available.

  • One-shot learning: we may only observe a single example of each possible class before making a prediction about a test instance.

2 Innovation

This paper uses siamese neural networks to deal with the problem of one-shot learning.

3 Adavantages

  • Once a siamese neural network has been tuned, we can then capitalize on powerful discriminative features to generalize the predictive power of the network not just to new data, but to entirely new classes from unknown distributions.
  • Using a convolutional architecture, we are able to achieve strong results which exceed those of other deep learning models with near state-of-the-art performance on one-shot classification tasks.
  • Li Fei-Fei et al. developed a variational Bayesian framework for one-shot image classification. (变贝叶斯框架)
  • Lake et al. addressed one-shot learning for character recognition with a method called Hierarchical Bayesian Program Learning(HBPL). (分层贝叶斯程序学习)

5 Model

Siamese Neural Network with L fully-connected layers

siamese neural network with L fully-connected layers

This paper tries 2-layer, 3-layer or 4-layer network.

h1,l h 1 , l : the hidden vector in layer l for the first twin.
h1,2 h 1 , 2 : the hidden vector in layer l for the second twin.
for the first L-1 layers:

h1,lh2,l=max(0,WTl1,lh1,(l1)+bl)=max(0,WTl1,lh2,(l1)+bl) h 1 , l = m a x ( 0 , W l − 1 , l T h 1 , ( l − 1 ) + b l ) h 2 , l = m a x ( 0 , W l − 1 , l T h 2 , ( l − 1 ) + b l )

for the last layer:

p=σ(jaj|h(j)1,lh(j)2,l|) p = σ ( ∑ j a j | h 1 , l ( j ) − h 2 , l ( j ) | )

where σ σ is the sigmoidal activation function.

Siamese Neural Network with CNN

siamese neural network with CNN

first twin: (conv ReLU max-pooling)*3 conv FC sigmoid
second twin: (conv ReLU max-pooling)*3 conv FC sigmoid

h(k)1,lh(k)2,l=max-pool(max(0,W(k)l1,lh1,(l1)+bl),2)=max-pool(max(0,W(k)l1,lh2,(l1)+bl),2) h 1 , l ( k ) = max-pool ( m a x ( 0 , W l − 1 , l ( k ) ⋆ h 1 , ( l − 1 ) + b l ) , 2 ) h 2 , l ( k ) = max-pool ( m a x ( 0 , W l − 1 , l ( k ) ⋆ h 2 , ( l − 1 ) + b l ) , 2 )

where k is the k-th filter map, is the convolutional operation.

for the last fully connected layer:

p=σ(jaj|h(j)1,lh(j)2,l|) p = σ ( ∑ j a j | h 1 , l ( j ) − h 2 , l ( j ) | )

6 Learning

Loss function

M: minibatch size
y(x(i)1,x(i)2) y ( x 1 ( i ) , x 2 ( i ) ) : the labels for the minibatch, if x1 x 1 and x2 x 2 are from the same classs, y(x(i)1,x(i)2)=1 y ( x 1 ( i ) , x 2 ( i ) ) = 1 , otherwise y(x(i)1,x(i)2)=0 y ( x 1 ( i ) , x 2 ( i ) ) = 0
loss function: regularized cross-entropy

L(x(i)1,x(i)2)=y(x(i)1,x(i)2)logp(x(i)1,x(i)2)+(1y(x(i)1,x(i)2))log(1p(x(i)1,x(i)2))+λT|w|2 L ( x 1 ( i ) , x 2 ( i ) ) = y ( x 1 ( i ) , x 2 ( i ) ) log ⁡ p ( x 1 ( i ) , x 2 ( i ) ) + ( 1 − y ( x 1 ( i ) , x 2 ( i ) ) ) log ⁡ ( 1 − p ( x 1 ( i ) , x 2 ( i ) ) ) + λ T | w | 2

Optimizaiton

ηj η j : learning rate for j layer
μj μ j : momentum for j layer
λj λ j : L2 L 2 regularization weights for j layer

update rule at epoch T is as follows:

wTkj(x(i)1,x(i)2)=w(T)kj+ΔwT(x(i)1,x(i)2)+2λj|w|Δw(T)kj(x(i)1,x(i)2)=ηjw(T)kj+μjΔw(T)kj w k j T ( x 1 ( i ) , x 2 ( i ) ) = w k j ( T ) + Δ w T ( x 1 ( i ) , x 2 ( i ) ) + 2 λ j | w | Δ w k j ( T ) ( x 1 ( i ) , x 2 ( i ) ) = − η j ∇ w k j ( T ) + μ j Δ w k j ( T )

where w(T)kj ∇ w k j ( T ) is the partial derivative with respect to the weight between the j-th neuron in some layer and the k-th neuron in the successive layer.

Weight initialization

Siamese Neural Network with L fully-connected layers

W of fully-connected layers: normal distribution, zero-mean, standard deviation 1fanin 1 f a n − i n (fan-in = nl1 n l − 1 )
b of fully-connected layers: normal distribution, mean 0.5, standard deviation 0.01

Siamese Neural Network with CNN

W of fully-connected layers: normal distribution, zero-mean, standard deviation 0.2
b of fully-connected layers: normal distribution, mean 0.5, standard deviation 0.01
w of convolution layers: normal distribution, zero-mean, standard deviation 0.01
b of convolution layers: normal distribution, mean 0.5, standard deviation 0.01

Learning schedule

Learning reates are decayed by η(T1)j=0.99η(T1)j η j ( T − 1 ) = 0.99 η j ( T − 1 ) .
Momentum starts at 0.5 in every layer, increasing linearly each epoch until reaching the value μj μ j .

This paper trained siamese neural network with L fully-connected layer for 300 epochs, and siamese neural network with CNN for 200 epochs.

This paper monitored one-shot validatioin error on a set of 320 one-shot learning tasks. When the validation error did not decrease for 20 epochs, This paper stopped and used the parameters of the model at the best epoch according to the one-shot validation error.

Omniglot dataset

The Omniglot data set contains examples from 50 alphabets, and from about 15 to upwards of 40 characters in each alphabet. All characters across these alphabets are produced a single time by each of 20 drawers.

examples in the Omniglot data set
examples in the Omniglot data set

Affine distortions

This paper augmented the training set with small affine distortions. For each image pair x1,x2 x 1 , x 2 , the paper generate a pair of affine transformations T1,T2 T 1 , T 2 to yield x1=T1(x1) x 1 ′ = T 1 ( x 1 ) , x2=T2(x2) x 2 ′ = T 2 ( x 2 ) , and T=(θ,ρx,ρy,sx,sy,tx,ty) T = ( θ , ρ x , ρ y , s x , s y , t x , t y ) .

A sample of random affine distortions generated for a single character in the Omniglot data set.
sample of affine distortions

Train

The size of mini-batch is 32. The samples of mini-batch is:

the samples of mini-batch

7 Experiment

Test

The samples of test, N-way, the follow image shows 20-way.
20-way

Results

Siamese Neural Network with L fully-connected layers

accuracy on Omniglot verification task

Siamese Neural Network with CNN

accuracy on Omniglot verification task

One-shot Image Recognition

Example of the model’s top-5 classification performance on 1-versus-20 one-shot classification task.
Example of the model's top-5 classification performance on 1-versus-20 one-shot classification task

One-shot accuracy on evaluation set:
One-shot accuracy on evaluation set

Comparing best one-shot accuracy from each type of network against baselines:
Comparing best one-shot accuracy from each type of network against baselines

参考:https://blog.csdn.net/bryant_meng/article/details/80087079
code address: https://github.com/sorenbouma/keras-oneshot

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值