Course4-week4-face recognition and neural style transfer

1 - what is face recognition?

This week will show you a couple important special applications of CONVnet, we will start with face recognition and then go on to neural style transfer.

  • Verification:
    • Input image, name
    • output whether the input image is that of the claimed person
  • Recognition:
    • has a datebase of K persons
    • get an input image
    • output the ID if the image is any of the K person

The recognition problem is much harder than the varification problem, let’s say you have a verification system that’s 99% accruate, now suppose that K is equal to 100 in the recognition system. With a 100 persons in the database, you now have a 100 times of chance making a mistake. So if you want an acceptable recognition error, you might actually need a verification system with 99.9 or even higher accuracy. Next we will focus on building a face verification system as a building block and then if the accuracy is high enough, then you might be use that in a recognition system.

2 - one-shot learning

One of the challenges of face recognition is that you need to solve the one-shot learning problem. What that mean is that, for most face recognition applications, you need to recognize a person given just one single image, but deep learning algorithm don’t work well if you have only one training example.

Suppose you have a database of 4 pictures of employees in your organization. One of the traditinoal approach you could try is to input the image of person feed into a CONV NET, and have it to output a label y using a softmax unit with 5 outpus which corresponding to each of 4 persons and the case of none of all. But this really doesn’t work well, because you have such a small training set is really not enought to train a robust neural netowrk for this task. And also whatever new person joins the database, so now you have five persons need to recognized, new you have to change the architecture of the network and retrain it, that just doesn’t seem like a good approach.


这里写图片描述

So to carry out one-shot learning, what we are going to do instead is learning similarity function. In particular, you want a neural network to learn a function which inputs twos images and outputs the degree of the difference between the two images, that is tell you basicaly they are the same persons or different persons.

d(img1, img2)=degree of difference between images d ( img1, img2 ) = degree of difference between images

{if d(img1, img2)τsameif d(img1, img2)>τdifferent { if  d (img1, img2) ≤ τ ⟶ same if  d (img1, img2) > τ ⟶ different

So this is how we address the face verification problem.

To use this for face recognition task, what you do is taken a new picture and use the function d d to compare it with all pictures from your database which contain all employee from your team or company, so maybe it will output a very large number if those two images not are the same person, or hopefully it will output a very small number, if those two images are come from same person. If there is someone not in your database shows up, as you to use the function d to make all of pairwise comparisons, hopefully d d will output a very large number for all pairwise comparisons.

So we have addressed the one-shot learning problem, so long as we can learn the function d. Then if you have someome new join your team, whar you just need to do is add a picture from the new one to your database.

3 - siamese network

The job of the function d d is input two faces and tell you how similar or how different they are. A good way to do this is to use a Siamese network.


这里写图片描述

you are used to seeing convNet like above, it end up with a feature vector f(x(i)), we should think of f(x(1)) f ( x ( 1 ) ) as an encoding of the input image x(1) x ( 1 ) .

To build a face recognition system what we can do is feed the two pictures into the same neural network with the same parameters and get different vectors which represents or encodes the two images respectively.

if you believe that these encodings are a good representation of these two images what we can do is then define:

d(x(1),x(2))=f(x(1))f(x(2))22 d ( x ( 1 ) , x ( 2 ) ) = ‖ f ( x ( 1 ) ) − f ( x ( 2 ) ) ‖ 2 2

So this idea of running two identical convolutional neural network on two different inputs and then compairing them, is called Siamese network architecture.

How do you train this Siamese neural network?

what we want to get is the encoding this network compute can result the function d d tell us whether two pictures are of the same person.

  • The parameters of NN define an encoding f(x(i))

    • learning parameters so that:
      • if x(i),x(j) x ( i ) , x ( j ) are of the same person, f(x(1))f(x(2))2 ‖ f ( x ( 1 ) ) − f ( x ( 2 ) ) ‖ 2 is small(condition 1)
      • if x(i),x(j) x ( i ) , x ( j ) are of different person, f(x(1))f(x(2))2 ‖ f ( x ( 1 ) ) − f ( x ( 2 ) ) ‖ 2 is large(condition 2)
    • So what we can do is use back propagation to vary or update all the parameters in order to make sure the condition 1 and 2 are satisfied.

      So how the define the objective function to make your neural network learn to recognition faces.

      4 - triplet loss

      One way to learn the parameters of the neural network so that it gives you a good encoding for the pictures of faces is to define and apply gradient descent on the triplet loss function.

      To learn the parameters of the neural network, you need to look at three pictures at the same time, you will be looking at an anchor images, a positive image as well as a negative image. And you want the encoding to be similar between A and P, because they are of the same person, whereas you want the encoding to be quite different between A and N, cause they are of the different persons.


      这里写图片描述

      So to formailze this, what you want for the parameters or the encoding to have the following property:

      f(A)f(P)2d(A,P)f(A)f(N)2d(A,N) ‖ f ( A ) − f ( P ) ‖ 2 ⏟ d ( A , P ) ≤ ‖ f ( A ) − f ( N ) ‖ 2 ⏟ d ( A , N )

      f(A)f(P)2f(A)f(N)20 ⟶ ‖ f ( A ) − f ( P ) ‖ 2 − ‖ f ( A ) − f ( N ) ‖ 2 ≤ 0

      Notice that one trivial way to make sure this is satisfied is to just learn everything equals 0, that is f(x)=0 f ( x ) = 0 . So to make sure the neural network doesn’t just output 0 for all the encoding, also make sure it doesn’t set all the encodings equal to each other, what we going to do is modify the objective:

      f(A)f(P)2f(A)f(N)2+α0(1) (1) ‖ f ( A ) − f ( P ) ‖ 2 − ‖ f ( A ) − f ( N ) ‖ 2 + α ≤ 0

      This can prevent the neural network from output the trivial solutions. This is also called margin.

      For example, let's say the margin is set to 0.2, and the d(A, P) = 0.5, then we won't be satisfied if d(A, N) just a little bit bigger, say 0.51, even thought 0.51 is better than 0.5, but that's not good enough, we want d(A, N) to be much bigger than d(A, P). So this either push d(A, N) up or push d(A, P) down. So that what’s having a margin parameter here does, which is push the anchor positive pair and the anchor negative pair further away from each other.

      define the triplet loss function

      Loss function:

      ( what we want is let f(A)f(P)2f(A)f(N)2+α ‖ f ( A ) − f ( P ) ‖ 2 − ‖ f ( A ) − f ( N ) ‖ 2 + α less than or equal to 0. So, )

      L(A,P,N)=max(f(A)f(P)2f(A)f(N)2+α,0) L ( A , P , N ) = m a x ( ‖ f ( A ) − f ( P ) ‖ 2 − ‖ f ( A ) − f ( N ) ‖ 2 + α , 0 )

      So the effect of taking a max here is so long as f(A)f(P)2f(A)f(N)2+α ‖ f ( A ) − f ( P ) ‖ 2 − ‖ f ( A ) − f ( N ) ‖ 2 + α is less than 0 0 , then the loss is 0; but if on the other hand, if this is greater than 0 0 , the max operation end up selecting f(A)f(P)2f(A)f(N)2+α, and so you would have a positive loss. So by trying to minimize L(A,P,N) L ( A , P , N ) , this has the effect of trying to send f(A)f(P)2f(A)f(N)2+α ‖ f ( A ) − f ( P ) ‖ 2 − ‖ f ( A ) − f ( N ) ‖ 2 + α to be 0 0 or less than be 0.

      Cost function:

      the overall cost function for neural network can be sum over the training set of individual losses on different triplet:

      J=imL(A(i),P(i),N(i)) J = ∑ i m L ( A ( i ) , P ( i ) , N ( i ) )

      So for the purpose of training the face recognition system, you do need a dataset where you have multiple images of the same person. But after having trained this system, you can apply it to the one-shot learning problem where maybe you have only a single picture of someone you might be try to recognize.

      How do you actually choose triplets to form training set?

      If you choose the A, P and N randomly from your dataset subject A and P being the same person and A and N being different persons, the constraint d(A,P)+τd(A,N) d ( A , P ) + τ ≤ d ( A , N ) is very easy to satisfy, so the neural network won’t be learn mcuh more from it. So what we need to do for training a good system is to choose triplets that are “hard” to train on, that is d(A,P) d ( A , P ) is actually quite close to d(A,N) d ( A , N ) , so in that case, the algorithm has to try extra hard to push d(A,N) d ( A , N ) up, and push d(A,P) d ( A , P ) down, so that there is at lease a margin of α α between the left side and right side.

      FaceNet:

      So this is how we can use the triplet loss to train a neural network to output a good encoding for face recognition.

      5 - face recognition and binary classification

      The triplet loss is one of the good way to learn the parameters of ConvNet for face recognition. There’s another way to learn these parameters. Now we will talk about how face recognition can be posed as a binary classification problem.


      这里写图片描述

      Another way to training the neural network is to take pair of neural networks and have them both compute the embeddings of two images, respectively, and then have these two embeddings be input to a logistic regression units, to then just make prediction, where the output will be 1 if both of the embedding are of the same person; and 0 if both of there are of different person.

      Rather than just feeding the two embeddings of images into the logistic units, what we can do is taking the difference between the encodings as the features for logistic regression.

      y^=σ(k=1128wk | f(x(i))kf(x(j))k |+b) y ^ = σ ( ∑ k = 1 128 w k   |   f ( x ( i ) ) k − f ( x ( j ) ) k   | + b )

      y^=σ(k=1128wk ( f(x(i))kf(x(j))k )2f(x(i))k+f(x(j))k+b) y ^ = σ ( ∑ k = 1 128 w k   (   f ( x ( i ) ) k − f ( x ( j ) ) k   ) 2 f ( x ( i ) ) k + f ( x ( j ) ) k + b )

      In the setting of this architecture the input is a pair of images, the output y is either 0 or 1. And same as before, you are training a Siamese network. Notice that instead of having to compute the features of persons in the dataset every time when carry out face recognition, what you can do is precompute these embedding for persons in dataset, so when a employee walks in, you just need to use the upper ConvNet to get embedding, and then compare it with all other embeddings already precompute, and then to make a prediction.

      So when you treat the face recognition as a binary classification, the training set would like:


      这里写图片描述

      6 - what is the neural style transfer?

      One of the most fun and exciting application of convolutional neural network recently has been neural style transfer.


      这里写图片描述

      In order to implement the neural style transfer, you need to look at the features extracted by ConvNet at various layers, the shallow and the deeper layers.

      7 - what are deep ConvNet learning?

      What a deep ConvNet really learning?what we want to do is to visualize what the hidden units in different layers are computing. Here’s the way we can do.

      Pick a unit in layer 1, find the nine image patches that maximize the unit’s activation.

      And then repeat for other units and do the same thing, here we repeat 9 times.


      这里写图片描述

      So this is nine different representative neurons, and for each of them the nine image patches that they maximally activated on.

      Later units are actually seen larger image patches. We repeat the above whole procedure for different layers.


      这里写图片描述

      So this visualization shows 9 hidden units in different layers, and each of them shows 9 image patches that causes that hidden unit to have a very large activation

      layer1layer2layer3layer4layer5
      col 3 isright-aligned$1600这里写图片描述这里写图片描述

      We can see that there is a neuron seem to be a dog detector in the layer 5(lower right corner), but the set of dog detecting here seems to be more various than the dog detector in the layer 4(upper left corner).So we have gone a long way from detecting relatively simple things, such as edges in layer 1 to texture in layer 2, up to detecting very complex objects in the deeper layers, such as flower, person, dog and so on. This can us some better intuitions about what the shallow and deeper layers of a neural network are computing. Next let’s use this intuitions to building a neural style transfer algorithm.

      8 - cost function

      To build a neural style transfer system, let’s define a cost function for the generate image, what you see later is that by minimizing this cost function that you can generate the image you want.

      The problem formulation is that we are given a content image C, given a style image S, and our goal is to generate a new image G. So in order to implement neural style transfer, what we are going to do is define a cost function J(G) J ( G ) that measure how good is a generate image. And we will use gradient descent to minimize J(G) J ( G ) in order to generate the image.

      We fill define two parts of the cost function. The first part is called the content cost Jcontent(G,C) J c o n t e n t ( G , C ) and what is does is to measure how similar is the contents of the generated image G to the content of the content image C. The second part is a style cost Jstyle(G,S) J s t y l e ( G , S ) what is does is measure how similar is the style of G to the style of S.

      J(G)=αJcontent(G,C)+βJstyle(G,S) J ( G ) = α J c o n t e n t ( G , C ) + β J s t y l e ( G , S )

      The way the algorithm run is as follows:

      • initiate G randomly:
        • G: 100 by 100 by 3
      • using gradient descent to minimize J(G) J ( G ) :
        • G=GαGJ(G) G = G − α ∂ ∂ G J ( G )
          so we are updating the pixel value of the image G

      example:

      content image C and style image S:


      这里写图片描述
      as the algorithm runing:

      这里写图片描述
      This is the overall outline of the neural style transfer algorithm, first define a cost function for the generate image G, and minimize it.

      9 - content cost function

      The overall cost function of the neural style transfer algorithm as following:

      J(G)=αJcontent(G,C)+βJstyle(G,S) J ( G ) = α J c o n t e n t ( G , C ) + β J s t y l e ( G , S )

      So let’s figure out what should the content cost function be.

      • say you use hidden layer l l to compute content cost
        • if l is a very small number, then it would really force the G to pixel value very similar to the C; whereas if you use a very deeper layer, then it will make sure that there’s a dog somewhere in the generate image G if there is dog in the content image C. In practice, we will choose the middle of layers of the neural network neither too shallow nor too deep.
      • use pre-trained ConvNet.(VGG)
      • let a[l](C) a [ l ] ( C ) and a[l](G) a [ l ] ( G ) be the activations of layer l on the image C and G
      • So if a[l](C) a [ l ] ( C ) and a[l](G) a [ l ] ( G ) are similar, both images have similar content
      • So we define the content cost as following:

        Jcontent(G,C)=12a[l](G)a[l](C)2 J c o n t e n t ( G , C ) = 1 2 ‖ a [ l ] ( G ) − a [ l ] ( C ) ‖ 2

        it’s really the element-wise sum of the squared of difference between two activation in layer l l between the image C and G.

        10 - style cost function

        What does the style of the image mean? Let’s say you are using layer l’s activation on the neural network to measure “style”. We are going to define the style as the correlation between activations across different channels in layer l l .

        Why does this capture style?


        这里写图片描述

        let’s say the red channel corresponds to the neuron(or kernel, filter) to try to figure out if there’s the little vertical texture(something like in the upper middle), and the yellow channel corresponds to the neurons try to look for orange colored patches(something like in the middle left), so if those two channels has the high correlations what that mean is whatever part of the image has this type of vertical texture, that part of the image will probably have the orange color. And what does it mean for them to be uncorrelated? It mean that, whatever there is this vertical texture, it’s probably won’t have orange color. So the degree of this type correlation that give you a way of measuring how often these different high level features occur together and how often they don’t occur together.

        According this measure we will know how similar is the style of the G to the style of the input image S.

        Style matrix

        for style image and generate image, we can compute a style matrxi:

        let ai,j,k[l] activation at (i,j,k) ( i , j , k ) . G[l](S) G [ l ] ( S ) or G[l](G) G [ l ] ( G ) is shape of (n[l]c,n[l]c) ( n c [ l ] , n c [ l ] )

        G[l](S)kk=in[l]Hjn[l]Wa[l](S)i,j,ka[l](S)i,j,k G k k ′ [ l ] ( S ) = ∑ i n H [ l ] ∑ j n W [ l ] a i , j , k [ l ] ( S ) a i , j , k ′ [ l ] ( S )

        G[l](G)kk=in[l]Hjn[l]Wa[l](G)i,j,ka[l](G)i,j,k G k k ′ [ l ] ( G ) = ∑ i n H [ l ] ∑ j n W [ l ] a i , j , k [ l ] ( G ) a i , j , k ′ [ l ] ( G )

        G[l][S]kk G k k ′ [ l ] [ S ] and G[l][G]kk G k k ′ [ l ] [ G ] are the style matrix of S and G, respectively.

        The style cost function, if we doing this on layer l l , between S and G as following:

        Jstyle[l](S,G)=G[l](S)G[l](G)F2=1(2nw[l]nh[l]nc[l])2kk(Gkk[l](S)Gkk[l](G))2

        It turns out that you get more visually pleasing result if you use the style cost function from multiple different layers.

        Jstyle(S,G)=lλ[l]J[l]style(S,G) J s t y l e ( S , G ) = ∑ l λ [ l ] J s t y l e [ l ] ( S , G )

        Now we can define the overall cost function:

        J(G)=αJcontenct(C,G)+βJstyle(S,G) J ( G ) = α J c o n t e n c t ( C , G ) + β J s t y l e ( S , G )

        And then use gradient descent or more sophisticated optimization alogrithm to try to find the optimal G to minimize the J(G) J ( G ) .

        11 - 1D and 3D generalizations of models

        We have learned about ConvNet ranging from the architecture of ConvNet to how to use it for image recognition, to object detection, to face recognition and neural style transfer. It turns out that many of the ideas also apply not just 2d images but also 1d data as well as to 3d data.

        convolutions in 2D and 1D


        这里写图片描述

        for the 2d image:

        • 14×14×3 14 × 14 × 3 (input) convoled with 16 16 filters of shape 5×5×310×10×16 5 × 5 × 3 ⟶ 10 × 10 × 16

        • 10×10×16 10 × 10 × 16 convoled with 32 32 filters of shape 5×5×166×6×32 5 × 5 × 16 ⟶ 6 × 6 × 32

        for the 1d series:

        • 14×1 14 × 1 (input) convoled with 16 16 filters of shape 5×110×16 5 × 1 ⟶ 10 × 16
        • 10×16 10 × 16 convoled with 32 32 filters of shape 5×166×32 5 × 16 ⟶ 6 × 32

        This is the generalize from 2d data to 1d data, how about the 3d data?


        这里写图片描述

        for 3d volume:

        • 14×14×14×1 14 × 14 × 14 × 1 (input) convoled with 16 16 filters of shape 5×5×5×110×10×10×16 5 × 5 × 5 × 1 ⟶ 10 × 10 × 10 × 16
        • 10×10×10×16 10 × 10 × 10 × 16 convoled with 32 32 filters of shape 5×5×5×166×6×6×32 5 × 5 × 5 × 16 ⟶ 6 × 6 × 6 × 32
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值