《Siamese Neural Networks for One-shot Image Recognition》笔记

最新推荐文章于 2023-02-07 19:25:37 发布

叫什么就是什么

最新推荐文章于 2023-02-07 19:25:37 发布

阅读量1.2k

点赞数

分类专栏：笔记文章标签： Siamese Neural Networks

本文链接：https://blog.csdn.net/qq_24548569/article/details/82024652

版权

笔记专栏收录该内容

49 篇文章 0 订阅

订阅专栏

1 Motivation

Machine learning often break down when forced to make predictions about data for which little supervised information is available.
One-shot learning: we may only observe a single example of each possible class before making a prediction about a test instance.

2 Innovation

This paper uses siamese neural networks to deal with the problem of one-shot learning.

3 Adavantages

Once a siamese neural network has been tuned, we can then capitalize on powerful discriminative features to generalize the predictive power of the network not just to new data, but to entirely new classes from unknown distributions.
Using a convolutional architecture, we are able to achieve strong results which exceed those of other deep learning models with near state-of-the-art performance on one-shot classification tasks.

Li Fei-Fei et al. developed a variational Bayesian framework for one-shot image classification. (变贝叶斯框架)
Lake et al. addressed one-shot learning for character recognition with a method called Hierarchical Bayesian Program Learning(HBPL). (分层贝叶斯程序学习)

5 Model

Siamese Neural Network with L fully-connected layers

siamese neural network with L fully-connected layers

This paper tries 2-layer, 3-layer or 4-layer network.

$\mathbf{h}_{1,l}$ : the hidden vector in layer l for the first twin.
$\mathbf{h}_{1,2}$ : the hidden vector in layer l for the second twin.
for the first L-1 layers:

h 1, l h 2, l = m a x (0, W T l - 1, l h 1, (l - 1) + b l) = m a x (0, W T l - 1, l h 2, (l - 1) + b l)

$\begin{aligned} \mathbf{h}_{1,l} & = max(0, \mathbf{W}_{l-1,l}^T \mathbf{h}_{1,(l - 1)} + \mathbf{b}_l) \\ \mathbf{h}_{2,l} & = max(0, \mathbf{W}_{l-1,l}^T \mathbf{h}_{2,(l - 1)} + \mathbf{b}_l) \end{aligned}$

for the last layer:

p = σ (\sum j a j | h (j) 1, l - h (j) 2, l |)

$p = \sigma \left( \sum_{j} a_j |\mathbf{h}_{1,l}^{(j)} - \mathbf{h}_{2,l}^{(j)}| \right)$
where

σ σ $\sigma$ is the sigmoidal activation function.

Siamese Neural Network with CNN

siamese neural network with CNN

first twin: (conv $\to$ ReLU $\to$ max-pooling)*3 $\to$ conv $\to$ FC $\to$ sigmoid
second twin: (conv $\to$ ReLU $\to$ max-pooling)*3 $\to$ conv $\to$ FC $\to$ sigmoid

h (k) 1, l h (k) 2, l = max-pool (m a x (0, W (k) l - 1, l ⋆ h 1, (l - 1) + b l), 2) = max-pool (m a x (0, W (k) l - 1, l ⋆ h 2, (l - 1) + b l), 2)

$\begin{aligned} \mathbf{h}_{1,l}^{(k)} & = \text{max-pool}(max(0, \mathbf{W}_{l-1,l}^{(k)} \star \mathbf{h}_{1,(l - 1)} + \mathbf{b}_l), 2) \\ \mathbf{h}_{2,l}^{(k)} & = \text{max-pool}(max(0, \mathbf{W}_{l-1,l}^{(k)} \star \mathbf{h}_{2,(l - 1)} + \mathbf{b}_l), 2) \end{aligned}$
where k is the k-th filter map,

⋆ ⋆ $\star$ is the convolutional operation.

for the last fully connected layer:

p = σ (\sum j a j | h (j) 1, l - h (j) 2, l |)

$p = \sigma \left( \sum_{j} a_j |\mathbf{h}_{1,l}^{(j)} - \mathbf{h}_{2,l}^{(j)}| \right)$

6 Learning

Loss function

M: minibatch size
$\mathbf{y}(x_1^{(i)}, x_2^{(i)})$ : the labels for the minibatch, if $x_1$ and $x_2$ are from the same classs, $y(x_1^{(i)}, x_2^{(i)}) = 1$ , otherwise $y(x_1^{(i)}, x_2^{(i)}) = 0$
loss function: regularized cross-entropy

L (x (i) 1, x (i) 2) = y (x (i) 1, x (i) 2) log p (x (i) 1, x (i) 2) + (1 - y (x (i) 1, x (i) 2)) log (1 - p (x (i) 1, x (i) 2)) + λ T | w | 2

$L(x_1^{(i)}, x_2^{(i)}) = y(x_1^{(i)}, x_2^{(i)}) \log p(x_1^{(i)}, x_2^{(i)}) + (1 - y(x_1^{(i)}, x_2^{(i)})) \log (1 - p(x_1^{(i)}, x_2^{(i)})) + \mathbf{\lambda}^T |\mathbf{w}|^2$

Optimizaiton

$\eta_j$ : learning rate for j layer
$\mu_j$ : momentum for j layer
$\lambda_j$ : $L_2$ regularization weights for j layer

update rule at epoch T is as follows:

w T k j (x (i) 1, x (i) 2) = w (T) k j + Δ w T (x (i) 1, x (i) 2) + 2 λ j | w | Δ w (T) k j (x (i) 1, x (i) 2) = - η j \nabla w (T) k j + μ j Δ w (T) k j

$\begin{aligned} \mathbf{w}_{kj}^{T}(x_1^{(i)}, x_2^{(i)}) = \mathbf{w}_{kj}^{(T)} + \Delta \mathbf{w}^{T}(x_1^{(i)}, x_2^{(i)}) + 2\lambda_j |\mathbf{w}| \\ \Delta \mathbf{w}_{kj}^{(T)}(x_1^{(i)}, x_2^{(i)}) = - \eta_j \nabla \mathbf{w}_{kj}^{(T)} + \mu_j \Delta \mathbf{w}_{kj}^{(T)} \end{aligned}$
where

∇w(T)kj ∇ w k j ( T ) $\nabla \mathbf{w}_{kj}^{(T)}$ is the partial derivative with respect to the weight between the j-th neuron in some layer and the k-th neuron in the successive layer.

Weight initialization

Siamese Neural Network with L fully-connected layers

W of fully-connected layers: normal distribution, zero-mean, standard deviation $\frac{1}{fan-in}$ (fan-in = $n_{l-1}$ )
b of fully-connected layers: normal distribution, mean 0.5, standard deviation 0.01

Siamese Neural Network with CNN

W of fully-connected layers: normal distribution, zero-mean, standard deviation 0.2
b of fully-connected layers: normal distribution, mean 0.5, standard deviation 0.01
w of convolution layers: normal distribution, zero-mean, standard deviation 0.01
b of convolution layers: normal distribution, mean 0.5, standard deviation 0.01

Learning schedule

Learning reates are decayed by $\eta_j^{(T-1)} = 0.99 \eta_j^{(T-1)}$ .
Momentum starts at 0.5 in every layer, increasing linearly each epoch until reaching the value $\mu_j$ .

This paper trained siamese neural network with L fully-connected layer for 300 epochs, and siamese neural network with CNN for 200 epochs.

This paper monitored one-shot validatioin error on a set of 320 one-shot learning tasks. When the validation error did not decrease for 20 epochs, This paper stopped and used the parameters of the model at the best epoch according to the one-shot validation error.

Omniglot dataset

The Omniglot data set contains examples from 50 alphabets, and from about 15 to upwards of 40 characters in each alphabet. All characters across these alphabets are produced a single time by each of 20 drawers.

examples in the Omniglot data set

Affine distortions

This paper augmented the training set with small affine distortions. For each image pair $x_1, x_2$ , the paper generate a pair of affine transformations $T_1, T_2$ to yield $x_1' = T_1(x_1)$ , $x_2' = T_2(x_2)$ , and $T=(\theta, \rho_x, \rho_y, s_x, s_y, t_x, t_y)$ .

A sample of random affine distortions generated for a single character in the Omniglot data set.
sample of affine distortions

Train

The size of mini-batch is 32. The samples of mini-batch is:

the samples of mini-batch

7 Experiment

Test

The samples of test, N-way, the follow image shows 20-way.
20-way

Results

Siamese Neural Network with L fully-connected layers

accuracy on Omniglot verification task

Siamese Neural Network with CNN

accuracy on Omniglot verification task

One-shot Image Recognition

Example of the model’s top-5 classification performance on 1-versus-20 one-shot classification task.
Example of the model's top-5 classification performance on 1-versus-20 one-shot classification task