### （Matlab）Handwriting Number Recognition based on deep learning

# 1.Network design

## 1.1 CNN（Feature extraction network＋Classification of network）

With the rapid development of deep learning, its application has become more and more extensive, especially in visual recognition, speech recognition and natural language processing and many other fields. As one of the most widely used Network models in deep learning, Convolutional Neural Network (CNN) has gained more and more attention and research.In fact, CNN, as a classic machine learning algorithm, has been proposed and carried out some research as early as 1980s. However, under the influence of limited hardware computing capacity and lack of effective training data, it was difficult to train a high-performance deep convolutional neural network model without over-fitting. Therefore, a classic application scenario of CNN at that time was to recognize handwritten Numbers on bank checks, and it has been applied in practice. Along with the advance of computer hardware and big data technology, people are also trying to develop a different method to solve the depth of the difficulties of CNN in training, experts such as Kizhesky proposed a classical architecture of CNN, demonstrates the deep structure in the feature extraction of potential, and made a breakthrough in the image recognition tasks, heat up the wave of deep structure research. As an existing deep structure with certain application cases, the convolutional neural network also returns to people’s vision and can be further studied and applied.

The purpose of this sharing is to have a thorough understanding and experience of CNN’s construction process and construction principle from the bottom level, so that when conducting deep learning projects with the framework, the underlying construction can be clearly understood and better utilized on this basis

### 1.1.1 Basic architecture

The basic architecture of a convolutional neural network includes feature extractors and classifiers. Feature extractors are usually composed of the many convolutional layers and pooling layers. The process of convolution is extracting the features from the images and pooling layers can reduce the size of feature maps and increase the receptive field. The feature extractor is usually connected to a classifier, which is usually composed of a multi-layer perceptron.In particular, after the last feature extractor, all feature graphs are expanded and arranged into a vector to obtain the feature vector, which is used as the input of the later layer classifier.

### 1.1.2 Convolution layer

The basic operation of convolution operation is to convolve the convolution kernel with the corresponding region of the image to get a value. By constantly moving the convolution kernel sum on the image, the convolution value is calculated, and then the convolution operation of the whole image is completed. In the convolutional neural network, the convolutional layer involves not only the general image convolution, but also the concepts of kernel size and step length. The kernel size corresponds to the size of convolutional kernel, different convolution kernel size means utilize different scale to extract image features.The step size corresponds to how many pixels the convolution kernel moves, that is, the distance from front to back.

In this experiment, 20 9*9 filters are used in the convolutional layer for filtering, and the activation function is ReLU function.

#### 1.1.2.1 Local perception

People’s cognition of the outside world can be generally summarized as a process from local to global, and the pixel spatial relation of images is also of strong correlation between local and weak correlation at long distance. Therefore, each neuron of the convolutional neural network actually only needs to pay attention to the local perception of the image, and the global perception of the image can be obtained by synthesizing local information at a higher level, which also explains the idea of partial connectivity of the convolutional neural network. Similar to the structure of the visual system in biology, neurons in the visual cortex are used to receive information locally, that is, they only respond to certain stimuli in certain regions, showing the characteristics of partial connectivity.

#### 1.1.2.2 Shared parameters

It is a local perception processing. Assumes that each neuron corresponds to 100 parameters, a total of 106 neurons, so there are 100×106 parameters, which is still a large number. If the 100 parameters of these 106 neurons are equal, then the number of parameters is reduced to 100, that is, each neuron performs convolution operation with the same convolution kernel, which will greatly reduce the computation. No matter how many neurons there are in the hidden layer, the connection between the two layers only needs 100 parameters, which also explains the significance of parameter sharing.

#### 1.1.2.3 Multi-kernel convolution

If the convolution kernel of 10×10 dimensions is the same, then only one feature of the image can be extracted, with obvious limitations. It can be considered to improve feature categories by adding convolution kernels, such as choosing 16 different convolution kernels to learn 16 features. Among them, the convolution kernel is applied to the image to perform convolution operation, and different features of the image can be obtained, collectively referred to as Feature Map. Therefore, 16 different convolution kernels have 16 Feature maps, which can be regarded as different channels of the image. At this point, the convolutional layer contains 10×10×16=1600 parameters.

### 1.1.3 Pooling layer

Theoretically, feature sets obtained by convolutional layer can be directly used to train classifiers (such as the classic Softmax classifier), but this often leads to huge computational problems. Summary statistical features are calculated by calculating the average or maximum value of a particular feature in a local area of an image. Compared with the feature map obtained by convolutional layer calculation, these summary statistical features can not only achieve the purpose of dimensionality reduction, but also improve the training efficiency. The operation of feature Pooling is called Pooling. The average Pooling of 2*2 is adopted in this experiment.

### 1.1.4 Feature extraction network

Use reshape function converts feature extraction of network matrix to 2000 * 1 column vector, then the two hidden layer nodes, the two hidden layer nodes contain contain 95 neurons, respectively is the ReLU activation function are used in the middle. Finally, there are 10 output nodes that implement the single-hot-coded output, using the Softmax activation function.

# 2. Training methods

Delata rule + BP algorithm + cross entropy cost function + SGD (stochastic gradient descent algorithm) + momentum algorithm

## 2.1 Program Implementation

The functions needed after the MNIST package is downloaded.

Accuracy

```
function y = accuracy(W1,W2,W3,W4,X_Test,D_Test,epoch)
N=length(D_Test);
d_comp=zeros(1,N);
for k=1:N
X=X_Test(:,:,k);
V1=Conv(X,W1);%Custom function (no rotation, direct filtering)
Y1=ReLU(V1);
Y2=Pool(Y1);%Custom function, 2 by 2 average pooling operation
y1=reshape(Y2,[],1);
v2=W2*y1;
y2=ReLU(v2);
v3=W3*y2;
y3=ReLU(v3);
v=W4*y3;
y=Softmax(v);
[~,i]=max(y);%Find the largest element in the y vector, with I as its position index
d_comp(k)=i;%Save CNN's calculated values (identified Numbers)
end
[~,d_true]=max(D_Test);%Change the unique hot code
acc=sum(d_comp==d_true);%Count the number of correctly identified totals
fprintf('round %d:',epoch);
fprintf('Accuracy is %f\n',acc/N);%Output accuracy
end
```

Conv

```
function y = Conv(x, W)
% Convolution of a picture
%
% [wrow, wcol, numFilters] = size(W);
% [xrow, xcol, ~] = size(x);
% y = zeros(xrow - wrow + 1, xcol - wcol + 1, numFilters);
% for k = 1:numFilters
% y(:, :, k) = conv2(x, rot90(W(:, :, k),2), 'valid');
% end
[xrow, xcol, xcha] = size(x);
[wrow, wcol, wpage, numFilters] = size(W);
if xcha>1&&(xcha==wpage)
y = zeros(xrow - wrow + 1, xcol - wcol + 1, numFilters);
W = W(:,:,end:-1:1,:);
for i = 1:numFilters
for j = 1:wpage
W(:,:,j,i) = rot90(W(:,:,j,i),2);
end
y(:, :, i) = convn(x, W(:,:,:,i), 'valid');
end
else
y = zeros(xrow - wrow + 1, xcol - wcol + 1, wpage);
for k = 1:wpage
y(:, :, k) = conv2(x, rot90(W(:, :, k),2), 'valid');
end
end
end
```

Dropout

```
function ym = Dropout(y, ratio)
[m, n] = size(y);
ym = zeros(m, n);
num = round(m*n*(1-ratio));
idx = randperm(m*n, num);
ym(idx) = m*n / num;
end
```

Pool

```
function y = Pool(x)
%
% 2x2 mean pooling
%
%
y=(x(1:2:end,1:2:end,:)+x(2:2:end,1:2:end,:)+x(1:2:end,2:2:end,:)+x(2:2:end,2:2:end,:))/4;
end
```

ReLU

```
function y = ReLU(x)
y = max(0, x);
end
Softmax
function y = Softmax(x)
ex = exp(x);
y = ex / sum(ex);
end
```

test

```
clc;clear all; close all;
tic;
load MNISTData
%Initialize weights
alpha=0.01;
beta =0.01;
epoch=20;
W1=randn(9,9,20);
W2=(2*rand(95,2000)-1)/20 ;
W3=(2*rand(45,95)-1)/10;
W4=(2*rand(10,45)-1)/5;
mmt1 = zeros(size(W1));
mmt2 = zeros(size(W2));
mmt3 = zeros(size(W3));
mmt4 = zeros(size(W4));
for G=1:epoch
[xrow, xcol, xcha] = size(X_Train);
for I= 1:xcha
%%Convolution/Pooling layer
V1= Conv(X_Train(:,:,I),W1);
Y1=ReLU(V1);
Y2=Pool(Y1);
%%Classification layers
y1=reshape(Y2,[],1);
v2=W2*y1;
y2=ReLU(v2);
y2 = y2 .* Dropout(y2, 0.01);
v3=W3*y2;
y3=ReLU(v3);
y3 = y3 .* Dropout(y3, 0.01);
v=W4*y3;
y=Softmax(v);
e=D_Train(:,I)-y;
%%The error propagates forward
delta=e;%The cross entropy＋softmax
e3=W4'*delta;
delta3=(v3>0).*e3;
e2=W3'*delta3;
delta2=(v2>0).*e2;
e1=W2'*delta2;
E2=reshape(e1,size(Y2));
E1=zeros(size(Y1));E2_4=E2/4;
E1(1:2:end,1:2:end,:)=E2_4;
E1(1:2:end,2:2:end,:)=E2_4;
E1(2:2:end,1:2:end,:)=E2_4;
E1(2:2:end,2:2:end,:)=E2_4;
delta1=(V1>0).*E1;
%%Update the weights
[a,b,c]=size(W1);
for t=1:c
dW1(:,:,t)=alpha* conv2(X_Train(:,:,I),rot90(delta1(:,:,t),2),'valid');
mmt1(:,:,t)= dW1 (:,:,t)+ beta*mmt1(:,:,t);
W1(:,:,t)=W1(:,:,t)+mmt1(:,:,t);
% W1(:,:,t)=W1(:,:,t)+dW1(:,:,t);
end
dW4=alpha*delta*y3';
mmt4 = dW4 + beta*mmt4;
W4 = W4 + mmt4;
% W4 = W4 +dW4;
dW3=alpha*delta3*y2';
mmt3 = dW3 + beta*mmt3;
W3 = W3 + mmt3;
% W3 = W3 +dW3;
dW2=alpha*delta2*y1';
mmt2 = dW2 + beta*mmt2;
W2 = W2 + mmt2;
% W2 = W2 +dW2;
end
toc
%%statistical accuracy rate：
acc=accuracy(W1,W2,W3,W4,X_Test,D_Test,G);
end
```

## 2.2 Code interpretation

### 2.2.1 Load the data

```
clc; clear all; close all;
tic;
load MNISTData
```

### 2.2.2 Initialize learning rate, weight, number of cycles

```
alpha=0.01;
beta =0.01;
epoch=2
W1=randn(9,9,20);
W2=(2*rand(95,2000)-1)/20;
W3=(2*rand(45,95)-1)/10;
W4=(2*rand(10,45)-1)/5;
mmt2 = zeros(size(W2));
mmt3 = zeros(size(W3));
mmt4 = zeros(size(W4));
for G=1:epoch
```

### 2.2.3 Training CNN network

```
[xrow, xcol, xcha] = size(X_Train);
for I= 1:xcha
%%Convolution/Pooling layer
V1= Conv(X_Train(:,:,I),W1);
Y1=ReLU(V1);
Y2=Pool(Y1);
%%Classification layers ＋ dropout
y1=reshape(Y2,[],1);
v2=W2*y1;
y2=ReLU(v2);
y2 = y2 .* Dropout(y2, 0.01);
v3=W3*y2;
y3=ReLU(v3);
y3 = y3 .* Dropout(y3, 0.01);
v=W4*y3;
y=Softmax(v);
e=D_Train(:,I)-y;
```

### 2.2.4 BP algorithm + Delta rule + cross entropy cost function

Forward propagation of error (cross entropy + Softmax)

```
delta=e;
e3=W4'*delta;
delta3=(v3>0).*e3;
e2=W3'*delta3;
delta2=(v2>0).*e2;
e1=W2'*delta2;
E2=reshape(e1,size(Y2));
E1=zeros(size(Y1));E2_4=E2/4;
E1(1:2:end,1:2:end,:)=E2_4;
E1(1:2:end,2:2:end,:)=E2_4;
E1(2:2:end,1:2:end,:)=E2_4;
E1(2:2:end,2:2:end,:)=E2_4;
delta1=(V1>0).*E1;
```

### 2.2.5 Momentum algorithm + SGD

```
[a,b,c]=size(W1);
for t=1:c
dW1(:,:,t)=alpha* conv2(X_Train(:,:,I),rot90(delta1(:,:,t),2),'valid');
mmt1(:,:,t)= dW1 (:,:,t)+ beta*mmt1(:,:,t);
W1(:,:,t)=W1(:,:,t)+mmt1(:,:,t);
% W1(:,:,t)=W1(:,:,t)+dW1(:,:,t);
end
dW4=alpha*delta*y3';
mmt4 = dW4 + beta*mmt4;
W4 = W4 + mmt4;
% W4 = W4 +dW4;
dW3=alpha*delta3*y2';
mmt3 = dW3 + beta*mmt3;
W3 = W3 + mmt3;
% W3 = W3 +dW3;
dW2=alpha*delta2*y1';
mmt2 = dW2 + beta*mmt2;
W2 = W2 + mmt2;
% W2 = W2 +dW2;
end
toc
```

### 2.2.6 Assess training effectiveness

%% Code of statistical accuracy rate:

```
acc=accuracy(W1,W2,W3,W4,X_Test,D_Test);
end
```

# 3. The detection results

The figure above shows the optimal operation effect of four wheels and one wheel respectively, the accuracy rate is 98.11 percent.

# 4. Analysis of results

Through constant adjustment of network structure, algorithm and weight, the best result (98%+ accuracy rate) is obtained. By comparing the weights after training and the weights after initialization, and constantly adjusting the parameters, try to make the initial weights and the results after training almost the same size, such training effect is the best. In SGD, small batch, batch, according to the running time and accuracy of continuous comparison, finally choose the best SGD algorithm. The momentum algorithm does not improve the accuracy much, but it also improves the stability and update speed. In feature extraction network, changing from one hidden layer to two hidden layers is also helpful to the accuracy of experimental results. However, the selection of the step rate of the learning rate and momentum algorithm can only find the appropriate value through experiment after experiment. One round of training was not satisfactory, so we had several rounds of training. For dropout, the overfitting problem can be effectively solved in multiple rounds of training by this. I also tried the 2000*1000*100*10 classification layer. Although it did not meet the design requirements, I found that the training effect would be better with more nodes. With a maximum of 10 rounds, the accuracy rate can reach 98.9%.

# 5. Conclusion

The convolutional neural network has made many breakthroughs in image, video, speech and text processing. In this design, I learned some concepts, algorithms and optimization methods of convolutional neural network from the bottom through searching, modifying and understanding the code. And I think that’s the only way to understand where other programming software frameworks come from, what each step is doing. Of course, I was also familiar with the operation of convolutional neural network code on MATLAB through this design. After the simple realization of this target detection, I also deeply realized that he still needed my further research.

First of all, as the CNN layer becomes deeper and deeper, more and more requirements are put forward for large-scale effective data and high-performance computing power. At the same time, traditional manual collection of tag data requires a large amount of manpower and material resources, which also leads to the increase of costs. Therefore, unsupervised CNN learning method is becoming more and more important.

Secondly, in order to improve the training speed of CNN, some asynchronous SGD algorithms are generally adopted to achieve a certain effect through CPU and GPU cluster, but at the same time, certain requirements are put forward for hardware configuration. Therefore, the development of efficient and extensible training algorithm still has important practical value. In addition, the depth model often needs to occupy more memory space in a long time during the training process, which also brings great pressure to the operating environment. Therefore, how to reduce complexity and quickly train the model while ensuring accuracy is also an important research direction.

Thirdly, the key problem that CNN faces when applying to different tasks is how to choose appropriate training parameters, such as learning rate, convolution kernel size, convolution and pooling layer number, etc., which requires more technical accumulation and experience summary. These training parameters have internal correlation and bring high cost for parameter adjustment. Therefore, the choice of CNN architecture is still worth our in-depth study.