Chapter 13 Convolutional Neural Networks

最新推荐文章于 2020-12-27 23:44:08 发布

boywaiter

最新推荐文章于 2020-12-27 23:44:08 发布

阅读量671

点赞数

分类专栏： Hands-On Machine Learning with Scik 文章标签： python 机器学习深度学习

本文链接：https://blog.csdn.net/boywaiter/article/details/88750015

版权

Hands-On Machine Learning with Scik 专栏收录该内容

15 篇文章 1 订阅

订阅专栏

Chapter 13 Convolutional Neural Networks

OReilly. Hands-On Machine Learning with Scikit-Learn and TensorFlow读书笔记

13.1 The Architecture of the Visual Cortex

Many neurons in the visual cortex have a small local receptive field, meaning they react only to visual stimuli located in a limited region of the visual field. Some neurons have larger receptive fields, and they react to more complex patterns that are combinations of the lower-level patterns. These observations led to the idea that the higher-level neurons are based on the outputs of neighboring lower-level neurons (in Figure 13-1, notice that each neuron is connected only to a few neurons from the previous layer). This powerful architecture is able to detect all sorts of complex patterns in any area of the visual field.

13.2 Convolutional Layer

convolutional layer: neurons in the first convolutional layer are not connected to every single pixel in the input image, but only to pixels in their receptive fields. In turn, each neuron in the second convolutional layer is connected only to neurons located within a small rectangle in the first layer. This architecture allows the network to concentrate on low-level features in the first hidden layer, then assemble them into higher-level features in the next hidden layer, and so on. This hierarchical structure is common in real-world images, which is one of the reasons why CNNs work so well for image recognition.

A convolution is a mathematical operation that slides one function over another and measures the integral of their pointwise multiplication. It has deep connections with the Fourier transform and the Laplace transform, and is heavily used in signal processing. Convolutional layers actually use cross-correlations, which are very similar to convolutions (see http://goo.gl/HAfxXd for more details).

A neuron located in row $i$ , column $j$ of a given layer is connected to the outputs of the
neurons in the previous layer located in rows $i$ to $i + f_h – 1$ , columns $j$ to $j + f_w – 1$ , where $f_h$ and $f_w$ are the height and width of the receptive field (see Figure 13-3). In order for a layer to have the same height and width as the previous layer, it is common to add zeros around the inputs, as shown in the diagram. This is called zero padding.

It is also possible to connect a large input layer to a much smaller layer by spacing out the receptive fields, as shown in Figure 13-4. The distance between two consecutive receptive fields is called the stride. A neuron located in row $i$ , column $j$ in the upper layer is connected to the outputs of the neurons in the previous layer located in rows $\times s_h$ to $\times s_h + f_h – 1$ , columns $\times s_w +f_w – 1$ , where $s_h$ and $s_w$ are the vertical and horizontal strides.

13.2.1 Filters

A neuron’s weights,called filters (or convolution kernels), can be represented as a small image the size of the receptive field. A layer full of neurons using the same filter gives you a feature map, which highlights the areas in an image that are most similar to the filter. During training, a CNN finds the most useful filters for its task, and it learns to combine them into more complex patterns (e.g., a cross is an area in an image where both the vertical filter and the horizontal filter are active).

13.2.2 Stacking Multiple Feature Maps

Each convolutional layer is composed of several feature maps of equal sizes, so it is more
accurately represented in 3D (see Figure 13-6). Within one feature map, all neurons share the same parameters (weights and bias term), but different feature maps may have different parameters. A neuron’s receptive field is the same as described earlier, but it extends across all the previous layers’ feature maps. In short, a convolutional layer simultaneously applies multiple filters to its inputs, making it capable of detecting multiple features anywhere in its inputs.

Moreover, input images are also composed of multiple sublayers: one per color channel. There are typically three: red, green, and blue (RGB).

A neuron located in row $i$ , column $j$ of the feature map $k$ in a given convolutional layer $l$ is connected to the outputs of the neurons in the previous layer $l - 1$ , located in rows $\times s_w$ to $\times s_w + f_w – 1$ and columns $j\times s_h$ to $\times s_h + f_h – 1$ , across all feature maps (in layer l – 1). Note that all neurons located in the same row $i$ and column $j$ but in different feature maps are connected to the outputs of the exact same neurons in the previous layer.

Equation 13-1. Computing the output of a neuron in a convolutional layer
$z_{i,j,k}=b_k+\sum_{u=1}^{f_h}\sum_{v=1}^{f_w}\sum_{k'=1}^{f_{n'}}x_{i',j',k'}\cdot w_{u,v,k',k}\textrm{ with }\begin{cases}i'=i\cdot s_h+u+f_h-1\\j'=j\cdot s_w+v+f_w-1 \end{cases}$

$z_{i, j, k}$ is the output of the neuron located in row $i$ , column $j$ in feature map $k$ of the
convolutional layer (layer $l$ ).
$s_h$ and $s_w$ are the vertical and horizontal strides, $f_h$ and $f_w$ are the height and width of the receptive field, and $f_{n′}$ is the number of feature maps in the previous layer (layer $l - 1$ ).
$x_{i′, j′, k′}$ is the output of the neuron located in layer $l - 1$ , row $i'$ , column $j'$ , feature map $k'$ (or channel $k'$ if the previous layer is the input layer).
$b_k$ is the bias term for feature map $k$ (in layer $l$ ). You can think of it as a knob that
tweaks the overall brightness of the feature map $k$ .
$w_{u, v, k′ ,k}$ is the connection weight between any neuron in feature map $k$ of the layer
$l$ and its input located at row $u$ , column $v$ (relative to the neuron’s receptive field),
and feature map $k$ .

import numpy as np

fh,fw,fn1=2,3,3
sh,sw=1,1
cal_str=""
i,j,k=2,3,1
for u in range(1,fh+1):    
    for v in range(1,fw+1):        
        for k1 in range(1,fn1+1):
            i1=i*sh+u+fh-1
            j1=j*sw+v+fw-1
            cal_str+="x({},{},{})*w({},{},{},{})\n".format(i1,j1,k1,u,v,k1,k)
print(cal_str)

13.2.3 TensorFlow Implementation

In TensorFlow, each input image is typically represented as a 3D tensor of shape[height, width, channels]. A mini-batch is represented as a 4D tensor of shape[mini-batch size, height, width, channels]. The weights of a convolutional layer are represented as a 4D tensor of shape $f_h, f_w, f_n, f_{n′}]$ . The bias terms of a convolutional layer are simply represented as a 1D tensor of shape $f_n]$ .

Let’s look at a simple example. The following code loads two sample images, using Scikit-Learn’s load_sample_images() (which loads two color images, one of a Chinese temple, and the other of a flower). Then it creates two 7 × 7 filters (one with a vertical white line in the middle, and the other with a horizontal white line), and applies them to both images using a convolutional layer built using TensorFlow’s conv2d() function (with zero padding and a stride of 2). Finally, it plots one of the resulting feature maps (similar to the top-right image in Figure 13-5).

import tensorflow as tf
import numpy as np
from sklearn.datasets import load_sample_images
import matplotlib.pyplot as plt
tf.reset_default_graph()

#load sample images
dataset=np.array(load_sample_images().images,dtype=np.float32)
batch_size,height,width,channels=dataset.shape

#create 2 filters
filters_test=np.zeros(shape=(7,7,channels,2),dtype=np.float32)
filters_test[:,3,:,0]=1 #vertical line
filters_test[3,:,:,1]=1 #horizonal line

#create a graph with input X plus a convolutional layer applying the 2 filters
X=tf.placeholder(tf.float32,shape=(None,height,width,channels))
convolution=tf.nn.conv2d(X,filters_test,strides=[1,2,2,1],padding="SAME")

with tf.Session() as sess:
    output=sess.run(convolution,feed_dict={X:dataset})
    
plt.imshow(output[0,:,:,1]) #plot 1st image's 2nd feature map
plt.show();

The conv2d() line deserves a bit of explanation:

X is the input mini-batch (a 4D tensor, as explained earlier).
filters_test is the set of filters to apply (also a 4D tensor, as explained earlier).
strides is a four-element 1D array, where the two central elements are the vertical and horizontal strides ( $s_h$ and $s_w$ ). The first and last elements must currently be equal to 1. They may one day be used to specify a batch stride (to skip some instances) and a channel stride (to skip some of the previous layer’s feature maps or channels).
padding must be either "VALID" or "SAME":
- If set to "VALID", the convolutional layer does not use zero padding, and may ignore some rows and columns at the bottom and right of the input image,
  depending on the stride.
- If set to "SAME", the convolutional layer uses zero padding if necessary. In this case, the number of output neurons is equal to the number of input neurons divided by the stride, rounded up (in this example, ceil (13 / 5) = 3). Then zeros are added as evenly as possible around the inputs.

13.2.4 Memory Requirements

If training crashes because of an out-of-memory error, you can try reducing the mini-batch size. Alternatively, you can try reducing dimensionality using a stride, or removing a few layers. Or you can try using 16-bit floats instead of 32-bit floats. Or you could distribute the CNN across multiple devices.

13.3 Pooling Layer

The goal of pooling layers is to subsample (i.e., shrink) the input image in order to reduce the computational load, the memory usage, and the number of parameters (thereby limiting the risk of overfitting). Reducing the input image size also makes the neural network tolerate a little bit of image shift (location invariance).

A pooling layer typically works on every input channel independently, so the output depth is the same as the input depth. You may alternatively pool over the depth dimension, as we will see next, in which case the image’s spatial dimensions (height and width) remain unchanged, but the number of channels is reduced.

X=tf.placeholder(tf.float32,shape=(None,height,width,channels))
max_pool=tf.nn.max_pool(X,ksize=[1,2,2,1],strides=[1,2,2,1],padding="VALID")

with tf.Session() as sess:
    output=sess.run(max_pool,feed_dict={X:dataset})
    
plt.imshow(output[0].astype(np.uint8)) #plot the output for the 1st image

To create an average pooling layer, just use the avg_pool() function instead of max_pool().

13.4 CNN Architectures

Typical CNN architectures stack a few convolutional layers (each one generally followed by a ReLU layer), then a pooling layer, then another few convolutional layers (+ReLU), then another pooling layer, and so on. The image gets smaller and smaller as it progresses through the network, but it also typically gets deeper and deeper (i.e., with more feature maps) thanks to the convolutional layers (see Figure 13-9). At the top of the stack, a regular feedforward neural network is added, composed of a few fully connected layers (+ReLUs), and the final layer outputs the prediction (e.g., a softmax layer that outputs estimated class probabilities).

13.4.1 LeNet-5

Table 13-1. LeNet-5 architecture

Layer	Type	Maps	Size	Kernel size	Stride	Activation
Out	Fully Connected	$-$	10	$-$	$-$	RBF
F6	Fully Connected	$-$	84	$-$	$-$	tanh
C5	Convolution	120	$1\times 1$	$5\times 5$	1	tanh
S4	Avg Pooling	16	$5\times 5$	$2\times 2$	2	tanh
C3	Convolution	16	$10\times 10$	$5\times 5$	1	tanh
S2	Avg Pooling	6	$14\times 14$	$2\times 2$	2	tanh
C1	Convolution	6	$28\times 28$	$5\times 5$	1	tanh
In	Input	1	$32\times 32$	$-$	$-$	$-$

MNIST images are $28 \times 28$ pixels, but they are zero-padded to $32 \times 32$ pixels and
normalized before being fed to the network. The rest of the network does not use any padding, which is why the size keeps shrinking as the image progresses through the network.
The average pooling layers are slightly more complex than usual: each neuron computes the mean of its inputs, then multiplies the result by a learnable coefficient (one per map) and adds a learnable bias term (again, one per map), then finally applies the activation function.
Most neurons in C3 maps are connected to neurons in only three or four S2 maps (instead of all six S2 maps).
The output layer is a bit special: instead of computing the dot product of the inputs and the weight vector, each neuron outputs the square of the Euclidian distance between its input vector and its weight vector. Each output measures how much the image belongs to a particular digit class. The cross entropy cost function is now preferred, as it penalizes bad predictions much more, producing larger gradients and thus converging faster.

13.4.2 AlexNet

Table 13-2. AlexNet architecture

Layer	Type	Maps	Size	Kernel Size	Stride	Padding	Activation
Out	Fully Connected	$-$	1000	$-$	$-$	$-$	Softmax
F9	Fully Connected	$-$	4096	$-$	$-$	$-$	ReLU
F8	Fully Connected	$-$	4,096	$-$	$-$	$-$	ReLU
C7	Convolution	256	$13\times 13$	$3\times 3$	1	SAME	ReLu
C6	Convolution	384	$13\times 13$	$3\times 3$	1	SAME	ReLU
C5	Convolution	384	$13\times 13$	$3\times 3$	1	SAME	ReLU
S4	Max Pooling	256	$13\times 13$	$3\times 3$	2	VALID	$-$
C3	Convolution	256	$27\times 27$	$5\times 5$	1	SAME	ReLU
S2	Max Pooling	96	$27\times 27$	$3\times 3$	2	VALID	$-$
C1	Convolution	96	$55\times 55$	$11\times 11$	4	SAME	ReLU
In	Input	3(RGB)	$224\times 224$	$-$	$-$	$-$	$-$

To reduce overfitting, the authors used two regularization techniques we discussed in previous chapters: first they applied dropout (with a 50% dropout rate) during training to the outputs of layers F8 and F9. Second, they performed data augmentation by randomly shifting the training images by various offsets, flipping them horizontally, and changing the lighting conditions.

AlexNet also uses a competitive normalization step immediately after the ReLU step of layers C1 and C3, called local response normalization (LRN). This form of normalization makes the neurons that most strongly activate inhibit neurons at the same location but in neighboring feature maps (such competitive activation has been observed in biological neurons). This encourages different feature maps to specialize, pushing them apart and forcing them to explore a wider range of features, ultimately improving generalization.

Equation 13-2. Local response normalization
$b_i=a_i\left(k+\alpha\sum_{j=j_{\textrm{low}}}^{j_{\textrm{high}}}a_j^2\right)^{-\beta}\textrm{ with }\begin{cases}j_\textrm{high}=\min\left(i+\frac{r}{2}, f_n-1\right)\\j_{\textrm{low}}=\max(0,i-\frac{r}{2})\end{cases}$

$b_i$ is the normalized output of the neuron located in feature map $i$ , at some row $u$ and column $v$ (note that in this equation we consider only neurons located at this row and column, so $u$ and $v$ are not shown).
$a_i$ is the activation of that neuron after the ReLU step, but before normalization.
$\alpha, \beta$ , and $r$ are hyperparameters. $k$ is called the bias, and $r$ is called the depth radius.
$f_n$ is the number of feature maps.

In AlexNet, the hyperparameters are set as follows: $r = 2$ , $\alpha = 0.00002$ , $\beta = 0.75$ , and $k = 1$ . This step can be implemented using TensorFlow’s local_response_normalization() operation.

A variant of AlexNet called ZF Net is essentially AlexNet with a few tweaked hyperparameters (number of feature maps, kernel size, stride, etc.).

13.4.3 GoogLeNet

The great performance of GoogLeNet came in large part from the fact that the network was much deeper than previous CNNs (see Figure 13-11). This was made possible by sub-networks called inception modules, which allow GoogLeNet to use parameters much more efficiently than previous architectures: GoogLeNet actually has 10 times fewer parameters than AlexNet (roughly 6 million instead of 60 million).

Figure 13-10 shows the architecture of an inception module. The notation “3 × 3 + 2(S)” means that the layer uses a 3 × 3 kernel, stride 2, and SAME padding. The input signal is first copied and fed to four different layers. All convolutional layers use the ReLU activation function. Note that the second set of convolutional layers uses different kernel sizes (1 × 1, 3 × 3, and 5 × 5), allowing them to capture patterns at different scales. Also note that every single layer uses a stride of 1 and SAME padding (even the max pooling layer), so their outputs all have the same height and width as their inputs. This makes it possible to concatenate all the outputs along the depth dimension in the final depth concat layer (i.e., stack the feature maps from all four top convolutional layers). This concatenation layer can be implemented in TensorFlow using the concat() operation, with axis=3 (axis 3 is the depth).

Layers with $1\times 1$ kernels serve two purposes:

First, they are configured to output many fewer feature maps than their inputs, so they serve as bottleneck layers, meaning they reduce dimensionality (of feature maps) . This is particularly useful before the 3 × 3 and 5 × 5 convolutions, since these are very computationally expensive layers.
Second, each pair of convolutional layers ([1 × 1, 3 × 3] and [1 × 1, 5 × 5]) acts like a single, powerful convolutional layer, capable of capturing more complex patterns. Indeed, instead of sweeping a simple linear classifier across the image (as a single convolutional layer does), this pair of convolutional layers sweeps a two-layer neural network across the image.

In short, you can think of the whole inception module as a convolutional layer on steroids, able to output feature maps that capture complex patterns at various scales.

In news writing and politics, the phrase on steroids has become the go-to modifier for any new thing that is bigger and more advanced than a previous version.

https://grammarist.com/usage/on-steroids/

The number of feature maps output by each convolutional layer and each pooling layer is shown before the kernel size. The six numbers in the inception modules represent the number of feature maps output by each convolutional layer in the module (in the same order as in Figure 13-10). Note that all the convolutional layers use the ReLU activation function.

Let’s go through this network:

The first two layers divide the image’s height and width by 4 (so its area is divided
by 16), to reduce the computational load.
Then the local response normalization layer ensures that the previous layers learn
a wide variety of features.
Two convolutional layers follow, where the first acts like a bottleneck layer. As explained earlier, you can think of this pair as a single smarter convolutional layer.
Again, a local response normalization layer ensures that the previous layers capture a wide variety of patterns.
Next a max pooling layer reduces the image height and width by 2, again to speed up computations.
Then comes the tall stack of nine inception modules, interleaved with a couple max pooling layers to reduce dimensionality and speed up the net.
Next, the average pooling layer uses a kernel the size of the feature maps with VALID padding, outputting 1 × 1 feature maps: this surprising strategy is called global average pooling. It effectively forces the previous layers to produce feature maps that are actually confidence maps for each target class (since other kinds of features would be destroyed by the averaging step). This makes it unnecessary to have several fully connected layers at the top of the CNN (like in AlexNet), considerably reducing the number of parameters in the network and limiting the risk of overfitting.
The last layers are self-explanatory: dropout for regularization, then a fully connected layer with a softmax activation function to output estimated class probabilities.

13.4.4 ResNet

The key to being able to train such a deep network with 152 layers is to use skip connections (also called shortcut connections): the signal feeding into a layer is also added to the output of a layer located a bit higher up the stack.

When training a neural network, the goal is to make it model a target function $h(\textbf x)$ . If you add the input $\textbf x$ to the output of the network (i.e., you add a skip connection), then the network will be forced to model $f(\textbf x) = h(\textbf x) – \textbf x$ rather than $h(\textbf x)$ . This is called residual learning (see Figure 13-12).

At present, the ResNet architecture is both the most powerful and arguably the simplest, so it is really the one you should probably use for now.

TensorFlow Convolution Operations

conv1d() creates a convolutional layer for 1D inputs. This is useful, for example, in natural language processing, where a sentence may be represented as a 1D array of words, and the receptive field covers a few neighboring words.
conv3d() creates a convolutional layer for 3D inputs, such as 3D PET (positron-emission tomography) scan.
atrous_conv2d() creates an atrous convolutional layer (“à trous” is French for “with holes”). This is equivalent to using a regular convolutional layer with a filter dilated by inserting rows and columns of zeros (i.e., holes). For example, a 1 ×3 filter equal to [[1,2,3]] may be dilated with a dilation rate of 4, resulting in a dilated flter [[1, 0, 0, 0, 2, 0, 0, 0, 3]]. This allows the convolutional layer to have a larger receptive field at no computational price and using no extra parameters.
conv2d_transpose() creates a transpose convolutional layer, sometimes called a
deconvolutional layer, which upsamples an image. It does so by inserting zeros
between the inputs, so you can think of this as a regular convolutional layer using
a fractional stride. Upsampling is useful, for example, in image segmentation: in a typical CNN, feature maps get smaller and smaller as you progress through the network, so if you want to output an image of the same size as the input, you need an upsampling layer.
depthwise_conv2d() creates a depthwise convolutional layer that applies every filter to every individual input channel independently. Thus, if there are $f_n$ filters and $f_{n′}$ input channels, then this will output $f_n \times f_{n′}$ feature maps.
separable_conv2d() creates a separable convolutional layer that first acts like a
depthwise convolutional layer, then applies a 1 × 1 convolutional layer to the resulting feature maps. This makes it possible to apply filters to arbitrary sets of inputs channels.

boywaiter

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Chapter 13 Convolutional Neural Networks

Chapter 13 Convolutional Neural NetworksOReilly. Hands-On Machine Learning with Scikit-Learn and TensorFlow读书笔记13.1 The Architecture of the Visual CortexMany neurons in the visual cortex have a sma...
复制链接

扫一扫