猫狗识别之基础入门(1)

终于从小白开始用半个月时间完成了97%正确率的猫狗识别,准备用几篇博客来记录一下历程。本篇主要讲神经网络入门,BP神经网络原理,卷积神经网络的基本概念。
Classify Dog and Cat

Actually classifying dog and cat is the theme of Kaggle competition in 2013. So we can download the pictures which has been classified already in the database of Kaggle. The folder includes totally 25000 pictures, 12500 for dog and 12500 for cat respectively. We can use this these pictures to train and test our CNN model.

Basically it’s a very classic classification problems and we have lots of ways to do with it. The most popular way is of course convolutional neural network which is widely named as CNN. CNN is proposed by Kunihiko Fukishima in 1980. Some key conceptions such as convolution and pooling can be firstly discovered in his paper 1980-Fukushima-Neocognitron A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position although at that time they are not called convolution and pooling. But the fact is that many years before that neural network has been proposed and can also be used to classify the pictures. So what’s the strength of CNN compared to neural network?

Neural Network

Neuron:

Let’s see about what is neural network first as CNN is also based on NN. NN is based on the connection formation of human neural network which is made up of many neurons. This is a classic neuron:
在这里插入图片描述
We have m inputs and each of one has a weight. Then they are conducted to the Net input function which is simply sum. So after this we have the result:
在这里插入图片描述
Then the result has to go through an activation function. Why do we need the Activation function? Because sum function is only a linear function so no matter how many layers we have they can always be transformed into one layer as linear function can be combined in one math operation. So we need nonlinear function to make the output more flexible as real problem is always nonlinear. There are two very common types of activation functions: ReLU and Sigmoid.
ReLU:
在这里插入图片描述
Sigmoid:
在这里插入图片描述
Sigmoid is much older than ReLU so in general ReLU is more often used in our neural network as it has some benefits. I will discuss this later.

Neural Network:

Then a lot of neurons combined together and we get a neural network like this:
在这里插入图片描述
A classic neural network consists of three parts: input layer, hidden layers and output layer, the structure is like this:
在这里插入图片描述
So the input is [x_1 x_2…x_p], and then there are two hidden layers. First hidden layer consists of n neurons which is [b_1^((1) ) b_2^((1) )…b_n^((1))] and second layer consists of q nodes which is [b_1^((2) ) b_2^((2) )…b_q^((2))]. Every node in the first layer connect with every node in second layer, so they are called fully connected. Output is [y_1 y_2…y_q] and dimension is same as second hidden layer.
Total number of weight is pnq. So we can see if nodes is too much then we have even more weights so the calculation will be difficult and slow.

Now we know the classic structure of neural network. But what is this used for? Actually it’s used to fit function. Sometimes function is easily to understand, for example we want to find a function so it can pass as much as possible known points in the plan. We can use polynomial fit and least square method to figure out this kind of problem. Sometimes function is very abstract and seems no relation with math just like our purpose to classify dog and cat. Then we need to use neural network as it can fit any nonlinear function in theory. Let’s see a simple classify problem first: if we want to divide the number axis into two parts, one is larger than 0 while another is less than 0, how do we do segmentation? Easy, use original point is fine. So actually we use a thing whose dimension is lower than the target which should be divided to do separation. Original point is 0 dimensional and number axis is a line so it’s 1 dimensional. But if we want to divide this:
在这里插入图片描述
It’s a 2 dimensional problem, so we should use 1 dimensional thing to divide it which is a straight line, notice that curve is 2 dimensional so we cannot use curve. No one can separate two curves using a straight line but neural network has its own way: linear transformation. After space transformation, we have this:
在这里插入图片描述
Now we can divide this using a straight line. Therefore, for classify problems neural network actually use hyperplane which is 1 dimensional lower than the target to divide space into two parts using linear and nonlinear space transformation.

So how do we use neural network to fit a function? As I have discussed before, every neuron actually consists of two math operations: sum and activation function. Number operated are input and weights, while input is not decided by neural network itself so the only way to adjust neural network is to adjust weights so that the output is as close as possible to the desired output. Therefor training a neural network is actually training quite a lot of weights so that it can mimic the real function as far as possible. These weights which are trained are called parameters of neural network.

BP Neural Network:

Before BP Neural Network comes out human don’t have a very proper way to train parameters of neural network. BP Neural Networks is back propagation neural network. It’s similar to polynomial fitting actually. For every group parameters of coefficients of polynomial, we have to calculate the least square error, then adjust the parameter to make the least square error minimum. So BP neural network can be divided into two stages: forward propagation and back propagation.

Forward propagation has been discussed actually. Use input and weights at this time to calculate the output through whole neural network is called forward propagation. After forward propagation we have the output produced by the neural network now, and we have the desired output, so there will be an error. Then we can propagate back this error from the output layer to the input layer. Like this image shows:
在这里插入图片描述
The purpose of propagating error from end to begin is to use the error to adjust the weights. Now every node has the corresponding error, so how do we adjust the weights? We are going to use gradient decent method. In the following image Y axis is error, and X axis is weight. We should find the best weight so that error will be minimum. We know that if weight is moving toward to the direction along with gradient then the decent rate of error is most fast. This is the principle to find the best weight.
在这里插入图片描述
Let’s consider this model to see how gradient decent method works:
在这里插入图片描述
As we have discussed before, v_j (n) can be written as:
在这里插入图片描述
So we have:
在这里插入图片描述
And output y_j (n) can be written as:
在这里插入图片描述
So we have:
在这里插入图片描述
Where φ is the nonlinear activation function which is ReLU or Sigmoid. Then we have a desired output which is d_j (n), so we have the error:
在这里插入图片描述
So we have:
在这里插入图片描述
To ensure the error function to be continuous differentiable, we define the error energy function:
在这里插入图片描述
So we have:
在这里插入图片描述
Now we are going to derive the gradient of E_j (n) with respect to w_ji as at this direction we can get the minimum E_j (n) as soon as possible. According to chain rule, we have:
在这里插入图片描述
After we substitute all of them we have:
在这里插入图片描述
If we define the amendment of w_ji is ∆w_ji in this formation:
在这里插入图片描述
Where η is called learning rate which decides how fast of the decent. Then in next train step parameter w_ji will be w_ji+∆w_ji. That’s how we adjust parameters in neural network.

As we can see, gradient is related with derivative of activation function φ^,(v_j (n)) so it’s important to choose a good activation function. We use ReLU in most time actually. Why?

(1) Derivative of Sigmoid contains divide operation and exponent arithmetic so it will cost much more time than ReLU, usually 6 times faster.
在这里插入图片描述
(2) For deep network, gradient of Sigmoid might disappear so it will loss some important information. We can see that when x is too big or small the derivative of Sigmoid will tend to 0.
在这里插入图片描述
(3) ReLU is zero in some area which means in some time only a part of nodes will work while others are dead, and this is called sparse network which can reduce computational cost.

But of course ReLU has shortcomings such as if learning rate is inappropriate too many nodes will dead which we are not happy to see. So we have some improvement of ReLU such as Leaky ReLU or PReLU.

Convolutional Neural Network

Traditional neural network do can be used in image recognization. In fact, two layers of fully connected neunral networks can fit any arbitrary function. However, it doesn’t work very well for image as image has too much information and no computer can calculate such large amount parameters. Then convolutional neural network come out, one of the deep learning model, give the whole world a human like way to do image recognization in computer.

Deep Learning

The real turning point of deep learning came in October 2012. At a workshop in Florence, Italy, Fei-Fei Li, the head of the Stanford AI Lab and the founder of the prominent annual ImageNet computer-vision contest, announced that two of Hinton’s students had invented software that identified objects with almost twice the accuracy of the nearest competitor. Then a lot of researchers started to work in this area and prove deep learning is very useful in image recognization, voice recognization and something else. Popular models Alphago and Alphago zero which defeat most talented Go player in world are both based on deep learning, while the latter one also uses reinforcement learning.

So what is deep learning? According to wikipedia, deep learning is a class of machine learning algorithms that uses multiple layers to progressively extract higher level features from the raw input. Our model structure CNN is also a kind of deep learning model. The key point of deep learning is deep which means we have to use mutiple different layers to extract features. Lower layers extract lower features like corner, lines and curves. Middle layers extract middle features like eyes, ears, mouth. Higher layers extract higher features like head, face, foot. Most high layers extract features like human, dog, and chair. It’s very similar in the human way to classify things.
在这里插入图片描述
But how does computer extract features? The answer is convolution, a very important operation in nowadays computer vision area.

Convolution

The major difference between convolutional neural network and neural network is the former has convolutional operation to the image while the latter not. So what is convolution?

Convolution is a kind of operation to extract features of image.
在这里插入图片描述
We need a convolutional filter, usually 33 or 55, and there is a number on each block. Then we use this filter to do convolutional operation of the image. We choose part of image with the same size of filter. For example, if our filter is 33 so the part of image we choose must be 33 like blue square frame in the illustration. Each pixel of the image has three values which represents RGB channel respectively. For each channel, there is a value for each pixel ranges from 0 to 255. For simplicity, value of each pixel of the example in illustration is neither 1 or 0. Then we let filter overlap with this part of square and multiply the corresponding number for each pixel. So we have the result for this square:

-10-10+11-10+11-10+11-10-1*0=3

Then we move the filter to overlap other parts of image then do convolution again and again. Finally we can get the convolutional layer. We can easily find that only those area whose feature is similar to the convolutional filter have the large number 3, while others are all zero. So from the convolutional layer we can easily determine where has slants.

Locally Connectivity

The first advantage of CNN is locally connectivity. Let’s see this classic picture to explain:
在这里插入图片描述
If we use normal neural network we discussed before to deal with picture it will have big problem. As the illustration shows, if we consider every pixel of image as an input, 1000*1000 image will have 1M inputs. Then if we have 1M hidden units, then we will have totally 10^12 parameters. That’s impossible to train.

Now let’s see how locally connected neural network works: every node only connect 10*10 pixels. Under this way, we will have 10^8 parameters, quite smaller than fully connected. So why can we use locally connected neural network? Because image has it’s own properties as pixels only have relationship with pixels which are around it. For example, pixels in top left corner almost has nothing to do with pixels in top right corner. So connect every pixel to one hidden unit is meaningless.

Parameter Sharing

Another important feature of CNN is parameter sharing. Let’s see how this works to reduce the total amount of parameters.
在这里插入图片描述
If all 1M hidden units in locally connected neural network has the same parameter, what will happen? We only have 100 parameters! But that means we have to use same convolutional filter to do operation in each 10*10 areas, that’s obvious not enough. So we can just add the number of filters to 100 as the result we can extract 100 different kinds of features of image, like this filter is to extract slants, that filter is to extract rounds, another filter is to extract corners, etc. This is called parameters sharing. We simply use different kinds of filters to do convolution to whole image rather than different filter in different part of image. Now we only have 10K parameters, that’s really amazing.

If our image has more than one channel, for example, our image has three channels because image has three fundamental colors RGB. In this case, we do convolution operation using the same filter on each channel, then add them together to get the final result.
在这里插入图片描述

Pooling Layer

We usually add a pooling layer after convolutional layer. Pooling also uses a nonlinear function which has a lot of choices, but the most common one is max pooling. It’s very easy actually, we use a 22 filter to do similar operation as convolution. But now we don’t do multiply for each pixel respectively, we simply find the maximum value in the overlapping 22 areas. Just like the illustration, the maximum number in top left 22 block is 6, maximum in top right 22 block is 8.
在这里插入图片描述
We will absolutely lose some information when we use pooling layer, but that does not matter a lot, because as we discussed before convolutional result will be large if this area has the similar feature as convolutional filter, so the smaller value can be ignored as there is no corresponding feature. What’s more, 22 filter can be almost ignored compared to 10001000 image, so the information which is lost is quite little. But the benefit is obvious, the pooling layer serves to progressively reduce the spatial size of the representation, to reduce the number of parameters, memory footprint and amount of computation in the network, and hence to also control overfitting. What is overfitting? It means model is too dependent of the train data, but lose the general property, I will discuss this later.

Fully Connected Layer

After convolutional layer and pooling layer, we already extract the features of the image and meantime reduce the amount of parameter quite a lot. Then now we are going to use traditional neural network which is fully connected layer to combine this feature so that our model can classify cat and dog based on feature we have extracted.

Now let’s see what a typical CNN looks like:
在这里插入图片描述
This is DeepID invented by Sun Yi in CUHK which is used to identify human face. Input is one channel image which means it’s a gray image. Then use three typical convolutional layer plus max-pooling layer to extract feature. Feature of human face is a 160 dimension vector, finally use a soft-max layer to classify different faces. This is a very typical CNN structure: some convolutional layer plus max-pooling layer, then a fully connected layer, finally a soft-max layer to classify. My model will also use this kind of structure.

  • 2
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值