Class4-Week4 Face Recognition & Neural Style Transfer

Face Recognition

Face Verification vs. Face Recognition

  • Verfication(1:1)
    • Input: image, name/ID
    • Output: whether the input image is that of the claimed person
  • Recognition(1:n)
    • Has a database of n persons
    • Get an input image
    • Output ID id the image is any of the n persons

Siamese Network

在这里插入图片描述

Siamese Network aims to learn a function that inputs a face image( x ( i ) x^{(i)} x(i)), and outputs a vector indicating the encoding of the face( f ( x ( i ) ) f(x^{(i)}) f(x(i))). Then learn the parmeters so that:

  • If x ( i ) x^{(i)} x(i), x ( j ) x^{(j)} x(j)are the same person, ∥ f ( x ( i ) ) − f ( x ( j ) ) ∥ \left \| f(x^{(i)})-f(x^{(j)}) \right \| f(x(i))f(x(j)) is small.
  • If x ( i ) x^{(i)} x(i), x ( j ) x^{(j)} x(j) are the different persons, ∥ f ( x ( i ) ) − f ( x ( j ) ) ∥ \left \| f(x^{(i)})-f(x^{(j)}) \right \| f(x(i))f(x(j)) is large.

Triplet Loss

One way to learn the parameters of the neural network so that it gives you a good encoding of faces is to define an applied gradient descent on the triplet loss function.

This is what gives rise to the term triplet loss, which is that We’ll always be using three images(Anchor, Positive, Negative) at a time. Positive means the image represents the same person as the Anchor, and Negative means that they are not the same person.

在这里插入图片描述
In this way, We always want the distance between the Anchor and the Positive to be small, whereas the Anchor and the Negative will be further apart. That means:
∥ f ( A ) − f ( P ) ∥ 2 ≤ ∥ f ( A ) − f ( N ) ∥ 2 \left \| f(A)-f(P) \right \|^{2} \leq \left \| f(A)-f(N) \right \|^{2} f(A)f(P)2f(A)f(N)2

Then:

∥ f ( A ) − f ( P ) ∥ 2 − ∥ f ( A ) − f ( N ) ∥ 2 ≤ 0 \left \| f(A)-f(P) \right \|^{2} - \left \| f(A)-f(N) \right \|^{2} \leq 0 f(A)f(P)2f(A)f(N)20

In order to prevent the network from having zero output regardless of the input, we need to make a slight change to this expression,

∥ f ( A ) − f ( P ) ∥ 2 − ∥ f ( A ) − f ( N ) ∥ 2 + α ≤ 0 \left \| f(A)-f(P) \right \|^{2} - \left \| f(A)-f(N) \right \|^{2} + \alpha \leq 0 f(A)f(P)2f(A)f(N)2+α0

α \alpha α is also called margin, which you may have catched in the literature of Support Vector Machine.

Finally, Loss Function:

L ( A , P , N ) = m a x ( ∥ f ( A ) − f ( P ) ∥ 2 − ∥ f ( A ) − f ( N ) ∥ 2 + α , 0 ) L(A,P,N) = max(\left \| f(A)-f(P) \right \|^{2} - \left \| f(A)-f(N) \right \|^{2} + \alpha, 0) L(A,P,N)=max(f(A)f(P)2f(A)f(N)2+α,0)

Need to noted that if A and N are two randomly chosen different persons, the network will be too easy to train to get a good model, so we’d better to choose the triplets (A, P, N) which are satified with ∥ f ( A ) − f ( P ) ∥ 2 ≈ ∥ f ( A ) − f ( N ) ∥ 2 \left \| f(A)-f(P) \right \|^{2} \approx \left \| f(A)-f(N) \right \|^{2} f(A)f(P)2f(A)f(N)2 to make neural network learn more on encoding faces.

Face Verification and Binary classification

We can also treat face recognition as a binary classification problem, let the neural network itself learn to tell if it’s the same person, just like the image below, Siamese Network:

在这里插入图片描述

Note that the two piplines in the network have the same weights. And as the ouput layer showing, we can adopt the formula below:

y ^ = σ ( ∑ k = 1 n W i ∣ f ( x ( i ) ) k − f ( x ( j ) ) k ∣ + b ) \hat{y}=\sigma (\sum_{k=1}^{n}W_{i}\left | f(x^{(i)})_{k}-f(x^{(j)})_{k} \right | + b) y^=σ(k=1nWif(x(i))kf(x(j))k+b)

or

y ^ = σ ( ∑ k = 1 n W i ∣ ( f ( x ( i ) ) k − f ( x ( j ) ) k ) 2 f ( x ( i ) ) k − f ( x ( j ) ) k ∣ + b ) \hat{y}=\sigma (\sum_{k=1}^{n}W_{i}\left | \frac{(f(x^{(i)})_{k}-f(x^{(j)})_{k})^{2}}{f(x^{(i)})_{k}-f(x^{(j)})_{k}} \right | + b) y^=σ(k=1nWif(x(i))kf(x(j))k(f(x(i))kf(x(j))k)2+b)

This is also called χ 2 \chi^{2} χ2 formula,

Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, Lior Wolf (2014). DeepFace: Closing the gap to human-level performance in face verification


Neural Style Transfer

What is Neural Style Transfer

Neural style transfer is an optimization technique used to take three images, a content image, a style reference image (such as an artwork by a famous painter), and the input image you want to style — and blend them together such that the input image is transformed to look like the content image, but “painted” in the style of the style image.

在这里插入图片描述

The principle of neural style transfer is to define two distance functions, one that describes how different the content of two images are, L c o n t e n t L_{content} Lcontent, and one that describes the difference between the two images in terms of their style, L s t y l e L_{style} Lstyle. Then, given three images, a desired style image, a desired content image, and the input image (initialized with the content image), we try to transform the input image to minimize the content distance with the content image and its style distance with the style image.

In summary, we’ll take the base input image, a content image that we want to match, and the style image that we want to match. We’ll transform the base input image by minimizing the content and style distances (losses) with backpropagation, creating an image that matches the content of the content image and the style of the style image.

L ( G ) = α L c o n t e n t ( C , G ) + β L s t y l e ( S , G ) L(G) = \alpha L_{content}(C, G) + \beta L_{style}(S, G) L(G)=αLcontent(C,G)+βLstyle(S,G)

In order to get both the content and style representations of our image, we will look at some intermediate layers within our model. Intermediate layers represent feature maps that become increasingly higher ordered as you go deeper. In this case, we are using the network architecture VGG19, a pretrained image classification network. These intermediate layers are necessary to define the representation of content and style from our images. For an input image, we will try to match the corresponding style and content target representations at these intermediate layers.

Why Intermediate Layers?

You may be wondering why these intermediate outputs within our pretrained image classification network allow us to define style and content representations. At a high level, this phenomenon can be explained by the fact that in order for a network to perform image classification (which our network has been trained to do), it must understand the image. This involves taking the raw image as input pixels and building an internal representation through transformations that turn the raw image pixels into a complex understanding of the features present within the image. This is also partly why convolutional neural networks are able to generalize well: they’re able to capture the invariances and defining features within classes (e.g., cats vs. dogs) that are agnostic to background noise and other nuisances. Thus, somewhere between where the raw image is fed in and the classification label is output, the model serves as a complex feature extractor; hence by accessing intermediate layers, we’re able to describe the content and style of input images.

Content Cost Function

Our content loss definition is actually quite simple. We’ll pass the network both the desired content image and our base input image. This will return the intermediate layer outputs (from the layers defined above) from our model. Then we simply take the euclidean distance between the two intermediate representations of those images.
Say we use the features extracted in the intermediate layer l:

L c o n t e n t ( C , G ) = ∑ ( a [ l ] ( C ) − a [ l ] ( G ) ) 2 L_{content}(C, G) = \sum(a^{[l](C)} - a^{[l](G)})^{2} Lcontent(C,G)=(a[l](C)a[l](G))2

Style Cost Function

Computing style loss is a bit more involved, but follows the same principle, this time feeding our network the base input image and the style image. However, instead of comparing the raw intermediate outputs of the base input image and the style image, we instead compare the Gram matrices of the two outputs.

Say we are using layer l’s activation to measure “style”, we define style as correlation between activations across channels and represented by a style matrix mathmatically.

  • Style matrix of style image(Gram matrix):
    G k , k ′ [ l ] ( S ) = ∑ i = 1 n H [ l ] ∑ j = 1 n W [ l ] a i , j , k [ l ] a i , j , k ′ [ l ] : n c [ l ] ∗ n c [ l ] G_{k, k^{'}}^{[l](S)}= \sum_{i=1}^{n_{H}^{[l]}} \sum_{j=1}^{n_{W}^{[l]}}a_{i, j, k}^{[l]}a_{i,j,k^{'}}^{[l]} :n_{c}^{[l]} * n_{c}^{[l]} Gk,k[l](S)=i=1nH[l]j=1nW[l]ai,j,k[l]ai,j,k[l]:nc[l]nc[l]
    k , k ′ k, k^{'} k,k refers to the k-th and k’-th channel respectively.

  • Style matrix of generated image(Gram matrix):
    G k , k ′ [ l ] ( G ) = ∑ i = 1 n H [ l ] ∑ j = 1 n W [ l ] a i , j , k [ l ] a i , j , k ′ [ l ] : n c [ l ] ∗ n c [ l ] G_{k, k^{'}}^{[l](G)}= \sum_{i=1}^{n_{H}^{[l]}} \sum_{j=1}^{n_{W}^{[l]}}a_{i, j, k}^{[l]}a_{i,j,k^{'}}^{[l]} :n_{c}^{[l]} * n_{c}^{[l]} Gk,k[l](G)=i=1nH[l]j=1nW[l]ai,j,k[l]ai,j,k[l]:nc[l]nc[l]
    k , k ′ k, k^{'} k,k refers to the k-th and k’-th channel respectively.

Then, the style cost function is:

L s t y l e ( S , G ) = 1 ( 2 n H [ l ] n W [ l ] n C [ l ] ) 2 ∥ G [ l ] [ S ] − G [ l ] ( G ) ∥ F 2 L_{style}(S, G) = \frac{1}{(2n_{H}^{[l]}n_{W}^{[l]}n_{C}^{[l]})^{2}}\left \| G^{[l][S]} - G^{[l](G)} \right \|_{F}^{2} Lstyle(S,G)=(2nH[l]nW[l]nC[l])21G[l][S]G[l](G)F2

Finally, we get the final loss of the neural network by adding the weights :
L ( G ) = α L c o n t e n t ( C , G ) + β L s t y l e ( S , G ) L(G) = \alpha L_{content}(C, G) + \beta L_{style}(S, G) L(G)=αLcontent(C,G)+βLstyle(S,G)

In this case, we can use the Adam optimizer in order to minimize our loss. We iteratively update our output image such that it minimizes our loss: we don’t update the weights associated with our network, but instead we train our input image to minimize loss.

Quote

1D and 3D generalizations of models

When we say Convolution Neural Network (CNN), generally we refer to a 2 dimensional CNN which is used for image classification. But there are two other types of Convolution Neural Networks used in the real world, which are 1 dimensional and 3 dimensional CNNs.

2D CNN

This is the standard Convolution Neural Network which was first introduced in Lenet-5 architecture. Conv2D is generally used on Image data. It is called 2 dimensional CNN because the kernel slides along 2 dimensions on the data as shown in the following image.

在这里插入图片描述

1D CNN

In order to correspond with the input of neural network, we can select kernels of different dimensions. For example, if we use a one-dimensional vector as input, the corresponding kernel slides along one dimension.

3D CNN

In Conv3D, the kernel slides in 3 dimensions as shown below. Conv3D is mostly used with 3D image data. Such as Magnetic Resonance Imaging (MRI) data. MRI data is widely used for examining the brain, spinal cords, internal organs and many more. A Computerized Tomography (CT) Scan is also an example of 3D data, which is created by combining a series of X-rays image taken from different angles around the body. We can use Conv3D to classify this medical data or extract features from it.

在这里插入图片描述

Here argument Input_shape (128, 128, 128, 3) has 4 dimensions. A 3D image is also a 4-dimensional data where the fourth dimension represents the number of colour channels. Just like a flat 2D image has 3 dimensions, where the 3rd dimension represents colour channels. Argument kernel_size (3,3,3) represents (height, width, depth) of the kernel, and 4th dimension of the kernel will be the same as the colour channel.

Summary

  • In 1D CNN, kernel moves in 1 direction. Input and output data of 1D CNN is 2 dimensional. Mostly used on Time-Series data.
  • In 2D CNN, kernel moves in 2 directions. Input and output data of 2D CNN is 3 dimensional. Mostly used on Image data.
  • In 3D CNN, kernel moves in 3 directions. Input and output data of 3D CNN is 4 dimensional. Mostly used on 3D Image data (MRI, CT Scans).
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值