Course4-week4-face recognition and neural style transfer

最新推荐文章于 2021-09-13 11:25:56 发布

土肥宅娘口三三

最新推荐文章于 2021-09-13 11:25:56 发布

阅读量548

点赞数

分类专栏： deep learning 文章标签： Andrew Ng deep learning deeplearning.ai

本文链接：https://blog.csdn.net/robin_Xu_shuai/article/details/80658755

版权

deep learning 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

1 - what is face recognition?

This week will show you a couple important special applications of CONVnet, we will start with face recognition and then go on to neural style transfer.

Verification:
- Input image, name
- output whether the input image is that of the claimed person
Recognition:
- has a datebase of K persons
- get an input image
- output the ID if the image is any of the K person

The recognition problem is much harder than the varification problem, let’s say you have a verification system that’s 99% accruate, now suppose that K is equal to 100 in the recognition system. With a 100 persons in the database, you now have a 100 times of chance making a mistake. So if you want an acceptable recognition error, you might actually need a verification system with 99.9 or even higher accuracy. Next we will focus on building a face verification system as a building block and then if the accuracy is high enough, then you might be use that in a recognition system.

2 - one-shot learning

One of the challenges of face recognition is that you need to solve the one-shot learning problem. What that mean is that, for most face recognition applications, you need to recognize a person given just one single image, but deep learning algorithm don’t work well if you have only one training example.

Suppose you have a database of 4 pictures of employees in your organization. One of the traditinoal approach you could try is to input the image of person feed into a CONV NET, and have it to output a label y using a softmax unit with 5 outpus which corresponding to each of 4 persons and the case of none of all. But this really doesn’t work well, because you have such a small training set is really not enought to train a robust neural netowrk for this task. And also whatever new person joins the database, so now you have five persons need to recognized, new you have to change the architecture of the network and retrain it, that just doesn’t seem like a good approach.

So to carry out one-shot learning, what we are going to do instead is learning similarity function. In particular, you want a neural network to learn a function which inputs twos images and outputs the degree of the difference between the two images, that is tell you basicaly they are the same persons or different persons.

d (img1, img2) = degree of difference between images

$d(\text{img1, img2}) = \text{degree of difference between images}$

{if d (img1, img2) \leq τ ⟶ same if d (img1, img2) > τ ⟶ different

$\begin{cases} \text{if } d\text{(img1, img2)} \le \tau \longrightarrow \text{same}\\ \text{if } d\text{(img1, img2)} > \tau \longrightarrow \text{different}\\ \end{cases}$

So this is how we address the face verification problem.

To use this for face recognition task, what you do is taken a new picture and use the function $d$ to compare it with all pictures from your database which contain all employee from your team or company, so maybe it will output a very large number if those two images not are the same person, or hopefully it will output a very small number, if those two images are come from same person. If there is someone not in your database shows up, as you to use the function $d$ to make all of pairwise comparisons, hopefully $d$ will output a very large number for all pairwise comparisons.

So we have addressed the one-shot learning problem, so long as we can learn the function $d$ . Then if you have someome new join your team, whar you just need to do is add a picture from the new one to your database.

3 - siamese network

The job of the function $d$ is input two faces and tell you how similar or how different they are. A good way to do this is to use a Siamese network.

you are used to seeing convNet like above, it end up with a feature vector $f(x^{(i)})$ , we should think of $f(x^{(1)})$ as an encoding of the input image $x^{(1)}$ .

To build a face recognition system what we can do is feed the two pictures into the same neural network with the same parameters and get different vectors which represents or encodes the two images respectively.

if you believe that these encodings are a good representation of these two images what we can do is then define:

d (x (1), x (2)) = ∥ f (x (1)) - f (x (2)) ∥ 22

$d(x^{(1)}, x^{(2)}) = \|f(x^{(1)}) - f(x^{(2)})\|_2^2$

So this idea of running two identical convolutional neural network on two different inputs and then compairing them, is called Siamese network architecture.

How do you train this Siamese neural network?

what we want to get is the encoding this network compute can result the function $d$ tell us whether two pictures are of the same person.

The parameters of NN define an encoding
- learning parameters so that:
  - if $x^{(i)}, x^{(j)}$ are of the same person, $\|f(x^{(1)}) - f(x^{(2)})\|^2$ is small(condition 1)
  - if $x^{(i)}, x^{(j)}$ are of different person, $\|f(x^{(1)}) - f(x^{(2)})\|^2$ is large(condition 2)
- So what we can do is use back propagation to vary or update all the parameters in order to make sure the condition 1 and 2 are satisfied.
  
  So how the define the objective function to make your neural network learn to recognition faces.
  
  4 - triplet loss
  
  One way to learn the parameters of the neural network so that it gives you a good encoding for the pictures of faces is to define and apply gradient descent on the triplet loss function.
  
  To learn the parameters of the neural network, you need to look at three pictures at the same time, you will be looking at an anchor images, a positive image as well as a negative image. And you want the encoding to be similar between A and P, because they are of the same person, whereas you want the encoding to be quite different between A and N, cause they are of the different persons.
  
  So to formailze this, what you want for the parameters or the encoding to have the following property:
  
  ∥f(A)−f(P)∥2d(A,P)≤∥f(A)−f(N)∥2d(A,N)
  
  ⟶∥f(A)−f(P)∥2−∥f(A)−f(N)∥2≤0
  
  Notice that one trivial way to make sure this is satisfied is to just learn everything equals 0, that is $f(x)=0$ . So to make sure the neural network doesn’t just output 0 for all the encoding, also make sure it doesn’t set all the encodings equal to each other, what we going to do is modify the objective:
  
  ∥f(A)−f(P)∥2−∥f(A)−f(N)∥2+α≤0(1)
  
  This can prevent the neural network from output the trivial solutions. This is also called margin.
  
  For example, let's say the margin is set to 0.2, and the d(A, P) = 0.5, then we won't be satisfied if d(A, N) just a little bit bigger, say 0.51, even thought 0.51 is better than 0.5, but that's not good enough, we want d(A, N) to be much bigger than d(A, P). So this either push d(A, N) up or push d(A, P) down. So that what’s having a margin parameter here does, which is push the anchor positive pair and the anchor negative pair further away from each other.
  
  define the triplet loss function
  
  Loss function:
  
  ( what we want is let $\|f(A) - f(P)\|^2 - \|f(A) - f(N)\|^2 + \alpha$ less than or equal to 0. So, )
  
  L(A,P,N)=max(∥f(A)−f(P)∥2−∥f(A)−f(N)∥2+α,0)
  
  So the effect of taking a max here is so long as $\|f(A) - f(P)\|^2 - \|f(A) - f(N)\|^2 + \alpha$ is less than $0$ , then the loss is $0$ ; but if on the other hand, if this is greater than $0$ , the max operation end up selecting $\|f(A) - f(P)\|^2 - \|f(A) - f(N)\|^2 + \alpha$ , and so you would have a positive loss. So by trying to minimize $\mathcal{L}(A, P, N)$ , this has the effect of trying to send $\|f(A) - f(P)\|^2 - \|f(A) - f(N)\|^2 + \alpha$ to be $0$ or less than be $0$ .
  
  Cost function:
  
  the overall cost function for neural network can be sum over the training set of individual losses on different triplet:
  
  J=∑imL(A(i),P(i),N(i))
  
  So for the purpose of training the face recognition system, you do need a dataset where you have multiple images of the same person. But after having trained this system, you can apply it to the one-shot learning problem where maybe you have only a single picture of someone you might be try to recognize.
  
  How do you actually choose triplets to form training set?
  
  If you choose the A, P and N randomly from your dataset subject A and P being the same person and A and N being different persons, the constraint $d(A, P) + \tau \le d(A, N)$ is very easy to satisfy, so the neural network won’t be learn mcuh more from it. So what we need to do for training a good system is to choose triplets that are “hard” to train on, that is $d(A, P)$ is actually quite close to $d(A, N)$ , so in that case, the algorithm has to try extra hard to push $d(A, N)$ up, and push $d(A, P)$ down, so that there is at lease a margin of $\alpha$ between the left side and right side.
  
  FaceNet:
  
  So this is how we can use the triplet loss to train a neural network to output a good encoding for face recognition.
  
  5 - face recognition and binary classification
  
  The triplet loss is one of the good way to learn the parameters of ConvNet for face recognition. There’s another way to learn these parameters. Now we will talk about how face recognition can be posed as a binary classification problem.
  
  Another way to training the neural network is to take pair of neural networks and have them both compute the embeddings of two images, respectively, and then have these two embeddings be input to a logistic regression units, to then just make prediction, where the output will be 1 if both of the embedding are of the same person; and 0 if both of there are of different person.
  
  Rather than just feeding the two embeddings of images into the logistic units, what we can do is taking the difference between the encodings as the features for logistic regression.
  
  y^=σ(∑k=1128wk | f(x(i))k−f(x(j))k |+b)
  
  y^=σ(∑k=1128wk ( f(x(i))k−f(x(j))k )2f(x(i))k+f(x(j))k+b)
  
  In the setting of this architecture the input is a pair of images, the output y is either 0 or 1. And same as before, you are training a Siamese network. Notice that instead of having to compute the features of persons in the dataset every time when carry out face recognition, what you can do is precompute these embedding for persons in dataset, so when a employee walks in, you just need to use the upper ConvNet to get embedding, and then compare it with all other embeddings already precompute, and then to make a prediction.
  
  So when you treat the face recognition as a binary classification, the training set would like:
  
  6 - what is the neural style transfer?
  
  One of the most fun and exciting application of convolutional neural network recently has been neural style transfer.
  
  In order to implement the neural style transfer, you need to look at the features extracted by ConvNet at various layers, the shallow and the deeper layers.
  
  7 - what are deep ConvNet learning?
  
  What a deep ConvNet really learning?what we want to do is to visualize what the hidden units in different layers are computing. Here’s the way we can do.
  
  Pick a unit in layer 1, find the nine image patches that maximize the unit’s activation.
  
  And then repeat for other units and do the same thing, here we repeat 9 times.
  
  So this is nine different representative neurons, and for each of them the nine image patches that they maximally activated on.
  
  Later units are actually seen larger image patches. We repeat the above whole procedure for different layers.
  
  So this visualization shows 9 hidden units in different layers, and each of them shows 9 image patches that causes that hidden unit to have a very large activation
  
  layer1 layer2 layer3 layer4 layer5
  
  We can see that there is a neuron seem to be a dog detector in the layer 5(lower right corner), but the set of dog detecting here seems to be more various than the dog detector in the layer 4(upper left corner).So we have gone a long way from detecting relatively simple things, such as edges in layer 1 to texture in layer 2, up to detecting very complex objects in the deeper layers, such as flower, person, dog and so on. This can us some better intuitions about what the shallow and deeper layers of a neural network are computing. Next let’s use this intuitions to building a neural style transfer algorithm.
  
  8 - cost function
  
  To build a neural style transfer system, let’s define a cost function for the generate image, what you see later is that by minimizing this cost function that you can generate the image you want.
  
  The problem formulation is that we are given a content image C, given a style image S, and our goal is to generate a new image G. So in order to implement neural style transfer, what we are going to do is define a cost function $J(G)$ that measure how good is a generate image. And we will use gradient descent to minimize $J(G)$ in order to generate the image.
  
  We fill define two parts of the cost function. The first part is called the content cost $J_{content}(G, C)$ and what is does is to measure how similar is the contents of the generated image G to the content of the content image C. The second part is a style cost $J_{style}(G, S)$ what is does is measure how similar is the style of G to the style of S.
  
  J(G)=αJcontent(G,C)+βJstyle(G,S)
  
  The way the algorithm run is as follows:
  - initiate G randomly:
    - G: 100 by 100 by 3
  - using gradient descent to minimize J(G) :
    - $G = G - α \partial \partial G J (G)$ $G = G - \alpha \frac{\partial}{\partial G}J(G)$ so we are updating the pixel value of the image G
  example:
  
  content image C and style image S:
  
  as the algorithm runing:
  
  This is the overall outline of the neural style transfer algorithm, first define a cost function for the generate image G, and minimize it.
  
  9 - content cost function
  
  The overall cost function of the neural style transfer algorithm as following:
  
  J(G)=αJcontent(G,C)+βJstyle(G,S)
  
  So let’s figure out what should the content cost function be.
  - say you use hidden layer l to compute content cost
    - if $l$ is a very small number, then it would really force the G to pixel value very similar to the C; whereas if you use a very deeper layer, then it will make sure that there’s a dog somewhere in the generate image G if there is dog in the content image C. In practice, we will choose the middle of layers of the neural network neither too shallow nor too deep.
  - use pre-trained ConvNet.(VGG)
  - let $a^{[l](C)}$ and $a^{[l](G)}$ be the activations of layer l on the image C and G
  - So if $a^{[l](C)}$ and $a^{[l](G)}$ are similar, both images have similar content
  - So we define the content cost as following:
    
    Jcontent(G,C)=12∥a[l](G)−a[l](C)∥2
    
    it’s really the element-wise sum of the squared of difference between two activation in layer $l$ between the image C and G.
    
    10 - style cost function
    
    What does the style of the image mean? Let’s say you are using layer $l$ ’s activation on the neural network to measure “style”. We are going to define the style as the correlation between activations across different channels in layer $l$ .
    
    Why does this capture style?
    
    let’s say the red channel corresponds to the neuron(or kernel, filter) to try to figure out if there’s the little vertical texture(something like in the upper middle), and the yellow channel corresponds to the neurons try to look for orange colored patches(something like in the middle left), so if those two channels has the high correlations what that mean is whatever part of the image has this type of vertical texture, that part of the image will probably have the orange color. And what does it mean for them to be uncorrelated? It mean that, whatever there is this vertical texture, it’s probably won’t have orange color. So the degree of this type correlation that give you a way of measuring how often these different high level features occur together and how often they don’t occur together.
    
    According this measure we will know how similar is the style of the G to the style of the input image S.
    
    Style matrix
    
    for style image and generate image, we can compute a style matrxi:
    
    let $a^{[l]}_{i, j, k}$ activation at $(i, j, k)$ . $G^{[l](S)}$ or $G^{[l](G)}$ is shape of $(n^{[l]}_c, n^{[l]}_c)$
    
    $G [l] (S) k k' = \sum i n [l] H \sum j n [l] W a [l] (S) i, j, k a [l] (S) i, j, k'$ $G^{[l](S)}_{kk'} = \sum_{i}^{n_H^{[l]}}\sum_{j}^{n_W^{[l]}}a^{[l](S)}_{i, j, k}a^{[l](S)}_{i, j, k'}$
    $G [l] (G) k k' = \sum i n [l] H \sum j n [l] W a [l] (G) i, j, k a [l] (G) i, j, k'$ $G^{[l](G)}_{kk'} = \sum_{i}^{n_H^{[l]}}\sum_{j}^{n_W^{[l]}}a^{[l](G)}_{i, j, k}a^{[l](G)}_{i, j, k'}$
    
    $G^{[l][S]}_{kk'}$ and $G^{[l][G]}_{kk'}$ are the style matrix of S and G, respectively.
    
    The style cost function, if we doing this on layer $l$ , between S and G as following:
    
    $J_{s t y l e}^{[l]} (S, G) = ‖ G^{[l] (S)} - G^{[l] (G)} ‖_{F}^{2} = \frac{1}{(2 n_{w}^{[l]} n_{h}^{[l]} n_{c}^{[l]})^{2}} \sum_{k} \sum_{k^{'}} (G_{k k^{'}}^{[l] (S)} - G_{k k^{'}}^{[l] (G)})^{2}$ $J_{style}^{[l]}(S, G) = \|G^{[l](S)} - G^{[l](G)}\|_F^2 = \frac1{(2n^{[l]}_{w}n^{[l]}_{h}n^{[l]}_{c})^2}\sum_k \sum_{k'} (G^{[l](S)}_{kk'} - G^{[l](G)}_{kk'})^2$
    
    It turns out that you get more visually pleasing result if you use the style cost function from multiple different layers.
    
    $J s t y l e (S, G) = \sum l λ [l] J [l] s t y l e (S, G)$ $J_{style}(S, G) = \sum_{l}\lambda^{[l]}J_{style}^{[l]}(S, G)$
    
    Now we can define the overall cost function:
    
    $J (G) = α J c o n t e n c t (C, G) + β J s t y l e (S, G)$ $J(G) = \alpha J_{contenct}(C, G) + \beta J_{style}(S, G)$
    
    And then use gradient descent or more sophisticated optimization alogrithm to try to find the optimal G to minimize the $J(G)$ .
    
    11 - 1D and 3D generalizations of models
    
    We have learned about ConvNet ranging from the architecture of ConvNet to how to use it for image recognition, to object detection, to face recognition and neural style transfer. It turns out that many of the ideas also apply not just 2d images but also 1d data as well as to 3d data.
    
    convolutions in 2D and 1D
    
    for the 2d image:
    
    $14 \times 14 \times 3$ (input) convoled with $16$ filters of shape $5 \times 5 \times 3 \longrightarrow 10 \times 10 \times 16$
    $10 \times 10 \times 16$ convoled with $32$ filters of shape $5 \times 5 \times 16 \longrightarrow 6 \times 6 \times 32$
    
    for the 1d series:
    
    $14 \times 1$ (input) convoled with $16$ filters of shape $5 \times 1 \longrightarrow 10 \times 16$
    $10 \times 16$ convoled with $32$ filters of shape $5 \times 16 \longrightarrow 6 \times 32$
    
    This is the generalize from 2d data to 1d data, how about the 3d data?
    
    for 3d volume:
    
    $14 \times 14 \times 14 \times 1$ (input) convoled with $16$ filters of shape $5 \times 5 \times 5 \times 1 \longrightarrow 10 \times 10 \times 10 \times 16$
    $10 \times 10 \times 10 \times 16$ convoled with $32$ filters of shape $5 \times 5 \times 5 \times 16 \longrightarrow 6 \times 6 \times 6 \times 32$

土肥宅娘口三三

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Course4-week4-face recognition and neural style transfer

1 - what is face recognition?This week will show you a couple important special applications of CONVnet, we will start with face recognition and then go on to neural style transfer. Verification:...
复制链接

扫一扫