Chapter 13 Convolutional Neural Networks

Chapter 13 Convolutional Neural Networks

OReilly. Hands-On Machine Learning with Scikit-Learn and TensorFlow读书笔记

13.1 The Architecture of the Visual Cortex

Many neurons in the visual cortex have a small local receptive field, meaning they react only to visual stimuli located in a limited region of the visual field. Some neurons have larger receptive fields, and they react to more complex patterns that are combinations of the lower-level patterns. These observations led to the idea that the higher-level neurons are based on the outputs of neighboring lower-level neurons (in Figure 13-1, notice that each neuron is connected only to a few neurons from the previous layer). This powerful architecture is able to detect all sorts of complex patterns in any area of the visual field.

13.2 Convolutional Layer

convolutional layer: neurons in the first convolutional layer are not connected to every single pixel in the input image, but only to pixels in their receptive fields. In turn, each neuron in the second convolutional layer is connected only to neurons located within a small rectangle in the first layer. This architecture allows the network to concentrate on low-level features in the first hidden layer, then assemble them into higher-level features in the next hidden layer, and so on. This hierarchical structure is common in real-world images, which is one of the reasons why CNNs work so well for image recognition.

A convolution is a mathematical operation that slides one function over another and measures the integral of their pointwise multiplication. It has deep connections with the Fourier transform and the Laplace transform, and is heavily used in signal processing. Convolutional layers actually use cross-correlations, which are very similar to convolutions (see http://goo.gl/HAfxXd for more details).

A neuron located in row i i i, column j j j of a given layer is connected to the outputs of the
neurons in the previous layer located in rows i i i to i + f h – 1 i + f_h – 1 i+fh1, columns j j j to j + f w – 1 j + f_w – 1 j+fw1, where f h f_h fh and f w f_w fw are the height and width of the receptive field (see Figure 13-3). In order for a layer to have the same height and width as the previous layer, it is common to add zeros around the inputs, as shown in the diagram. This is called zero padding.

It is also possible to connect a large input layer to a much smaller layer by spacing out the receptive fields, as shown in Figure 13-4. The distance between two consecutive receptive fields is called the stride. A neuron located in row i i i, column j j j in the upper layer is connected to the outputs of the neurons in the previous layer located in rows i × s h i \times s_h i×sh to i × s h + f h – 1 i \times s_h + f_h – 1 i×sh+fh1, columns j × s w + f w – 1 j \times s_w +f_w – 1 j×sw+fw1, where s h s_h sh and s w s_w sw are the vertical and horizontal strides.

13.2.1 Filters

A neuron’s weights,called filters (or convolution kernels), can be represented as a small image the size of the receptive field. A layer full of neurons using the same filter gives you a feature map, which highlights the areas in an image that are most similar to the filter. During training, a CNN finds the most useful filters for its task, and it learns to combine them into more complex patterns (e.g., a cross is an area in an image where both the vertical filter and the horizontal filter are active).

13.2.2 Stacking Multiple Feature Maps

Each convolutional layer is composed of several feature maps of equal sizes, so it is more
accurately represented in 3D (see Figure 13-6). Within one feature map, all neurons share the same parameters (weights and bias term), but different feature maps may have different parameters. A neuron’s receptive field is the same as described earlier, but it extends across all the previous layers’ feature maps. In short, a convolutional layer simultaneously applies multiple filters to its inputs, making it capable of detecting multiple features anywhere in its inputs.

Moreover, input images are also composed of multiple sublayers: one per color channel. There are typically three: red, green, and blue (RGB).

A neuron located in row i i i, column j j j of the feature map k k k in a given convolutional layer l l l is connected to the outputs of the neurons in the previous layer l – 1 l – 1 l1, located in rows i × s w i \times s_w i×sw to i × s w + f w – 1 i \times s_w + f_w – 1 i×sw+fw1 and columns j × s h j\times s_h j×sh to j × s h + f h – 1 j \times s_h + f_h – 1 j×sh+fh1, across all feature maps (in layer l – 1). Note that all neurons located in the same row i i i and column j j j but in different feature maps are connected to the outputs of the exact same neurons in the previous layer.

Equation 13-1. Computing the output of a neuron in a convolutional layer
z i , j , k = b k + ∑ u = 1 f h ∑ v = 1 f w ∑ k ′ = 1 f n ′ x i ′ , j ′ , k ′ ⋅ w u , v , k ′ , k  with  { i ′ = i ⋅ s h + u + f h − 1 j ′ = j ⋅ s w + v + f w − 1 z_{i,j,k}=b_k+\sum_{u=1}^{f_h}\sum_{v=1}^{f_w}\sum_{k'=1}^{f_{n'}}x_{i',j',k'}\cdot w_{u,v,k',k}\textrm{ with }\begin{cases}i'=i\cdot s_h+u+f_h-1\\j'=j\cdot s_w+v+f_w-1 \end{cases} zi,j,k=bk+u=1fhv=1fwk=1fnxi,j,kwu,v,k,k with {i=ish+u+fh1j=jsw+v+fw1

  • z i , j , k z_{i, j, k} zi,j,k is the output of the neuron located in row i i i, column j j j in feature map k k k of the
    convolutional layer (layer l l l).
  • s h s_h sh and s w s_w sw are the vertical and horizontal strides, f h f_h fh and f w f_w fw are the height and width of the receptive field, and f n ′ f_{n′} fn is the number of feature maps in the previous layer (layer l – 1 l – 1 l1).
  • x i ′ , j ′ , k ′ x_{i′, j′, k′} xi,j,k is the output of the neuron located in layer l – 1 l – 1 l1, row i ′ i′ i, column j ′ j′ j, feature map k ′ k′ k (or channel k ′ k′ k if the previous layer is the input layer).
  • b k b_k bk is the bias term for feature map k k k (in layer l l l). You can think of it as a knob that
    tweaks the overall brightness of the feature map k k k.
  • w u , v , k ′ , k w_{u, v, k′ ,k} wu,v,k,k is the connection weight between any neuron in feature map k k k of the layer
    l l l and its input located at row u u u, column v v v (relative to the neuron’s receptive field),
    and feature map k k k.
import numpy as np

fh,fw,fn1=2,3,3
sh,sw=1,1
cal_str=""
i,j,k=2,3,1
for u in range(1,fh+1):    
    for v in range(1,fw+1):        
        for k1 in range(1,fn1+1):
            i1=i*sh+u+fh-1
            j1=j*sw+v+fw-1
            cal_str+="x({},{},{})*w({},{},{},{})\n".format(i1,j1,k1,u,v,k1,k)
print(cal_str)

13.2.3 TensorFlow Implementation

In TensorFlow, each input image is typically represented as a 3D tensor of shape[height, width, channels]. A mini-batch is represented as a 4D tensor of shape[mini-batch size, height, width, channels]. The weights of a convolutional layer are represented as a 4D tensor of shape [ f h , f w , f n , f n ′ ] [f_h, f_w, f_n, f_{n′}] [fh,fw,fn,fn]. The bias terms of a convolutional layer are simply represented as a 1D tensor of shape [ f n ] [f_n] [fn].

Let’s look at a simple example. The following code loads two sample images, using Scikit-Learn’s load_sample_images() (which loads two color images, one of a Chinese temple, and the other of a flower). Then it creates two 7 × 7 filters (one with a vertical white line in the middle, and the other with a horizontal white line), and applies them to both images using a convolutional layer built using TensorFlow’s conv2d() function (with zero padding and a stride of 2). Finally, it plots one of the resulting feature maps (similar to the top-right image in Figure 13-5).

import tensorflow as tf
import numpy as np
from sklearn.datasets import load_sample_images
import matplotlib.pyplot as plt
tf.reset_default_graph()

#load sample images
dataset=np.array(load_sample_images().images,dtype=np.float32)
batch_size,height,width,channels=dataset.shape

#create 2 filters
filters_test=np.zeros(shape=(7,7,channels,2),dtype=np.float32)
filters_test[:,3,:,0]=1 #vertical line
filters_test[3,:,:,1]=1 #horizonal line

#create a graph with input X plus a convolutional layer applying the 2 filters
X=tf.placeholder(tf.float32,shape=(None,height,width,channels))
convolution=tf.nn.conv2d(X,filters_test,strides=[1,2,2,1],padding="SAME")

with tf.Session() as sess:
    output=sess.run(convolution,feed_dict={X:dataset})
    
plt.imshow(output[0,:,:,1]) #plot 1st image's 2nd feature map
plt.show();

The conv2d() line deserves a bit of explanation:

  • X is the input mini-batch (a 4D tensor, as explained earlier).
  • filters_test is the set of filters to apply (also a 4D tensor, as explained earlier).
  • strides is a four-element 1D array, where the two central elements are the vertical and horizontal strides ( s h s_h sh and s w s_w sw). The first and last elements must currently be equal to 1. They may one day be used to specify a batch stride (to skip some instances) and a channel stride (to skip some of the previous layer’s feature maps or channels).
  • padding must be either "VALID" or "SAME":
    • If set to "VALID", the convolutional layer does not use zero padding, and may ignore some rows and columns at the bottom and right of the input image,
      depending on the stride.
    • If set to "SAME", the convolutional layer uses zero padding if necessary. In this case, the number of output neurons is equal to the number of input neurons divided by the stride, rounded up (in this example, ceil (13 / 5) = 3). Then zeros are added as evenly as possible around the inputs.

13.2.4 Memory Requirements

If training crashes because of an out-of-memory error, you can try reducing the mini-batch size. Alternatively, you can try reducing dimensionality using a stride, or removing a few layers. Or you can try using 16-bit floats instead of 32-bit floats. Or you could distribute the CNN across multiple devices.

13.3 Pooling Layer

The goal of pooling layers is to subsample (i.e., shrink) the input image in order to reduce the computational load, the memory usage, and the number of parameters (thereby limiting the risk of overfitting). Reducing the input image size also makes the neural network tolerate a little bit of image shift (location invariance).

A pooling layer typically works on every input channel independently, so the output depth is the same as the input depth. You may alternatively pool over the depth dimension, as we will see next, in which case the image’s spatial dimensions (height and width) remain unchanged, but the number of channels is reduced.

X=tf.placeholder(tf.float32,shape=(None,height,width,channels))
max_pool=tf.nn.max_pool(X,ksize=[1,2,2,1],strides=[1,2,2,1],padding="VALID")

with tf.Session() as sess:
    output=sess.run(max_pool,feed_dict={X:dataset})
    
plt.imshow(output[0].astype(np.uint8)) #plot the output for the 1st image

To create an average pooling layer, just use the avg_pool() function instead of max_pool().

13.4 CNN Architectures

Typical CNN architectures stack a few convolutional layers (each one generally followed by a ReLU layer), then a pooling layer, then another few convolutional layers (+ReLU), then another pooling layer, and so on. The image gets smaller and smaller as it progresses through the network, but it also typically gets deeper and deeper (i.e., with more feature maps) thanks to the convolutional layers (see Figure 13-9). At the top of the stack, a regular feedforward neural network is added, composed of a few fully connected layers (+ReLUs), and the final layer outputs the prediction (e.g., a softmax layer that outputs estimated class probabilities).

13.4.1 LeNet-5

Table 13-1. LeNet-5 architecture

LayerTypeMapsSizeKernel sizeStrideActivation
OutFully Connected − - 10 − - − - RBF
F6Fully Connected − - 84 − - − - tanh
C5Convolution120 1 × 1 1\times 1 1×1 5 × 5 5\times 5 5×51tanh
S4Avg Pooling16 5 × 5 5\times 5 5×5 2 × 2 2\times 2 2×22tanh
C3Convolution16 10 × 10 10\times 10 10×10 5 × 5 5\times 5 5×51tanh
S2Avg Pooling6 14 × 14 14\times 14 14×14 2 × 2 2\times 2 2×22tanh
C1Convolution6 28 × 28 28\times 28 28×28 5 × 5 5\times 5 5×51tanh
InInput1 32 × 32 32\times 32 32×32 − - − - − -
  • MNIST images are 28 × 28 28 \times 28 28×28 pixels, but they are zero-padded to 32 × 32 32 \times 32 32×32 pixels and
    normalized before being fed to the network. The rest of the network does not use any padding, which is why the size keeps shrinking as the image progresses through the network.
  • The average pooling layers are slightly more complex than usual: each neuron computes the mean of its inputs, then multiplies the result by a learnable coefficient (one per map) and adds a learnable bias term (again, one per map), then finally applies the activation function.
  • Most neurons in C3 maps are connected to neurons in only three or four S2 maps (instead of all six S2 maps).
  • The output layer is a bit special: instead of computing the dot product of the inputs and the weight vector, each neuron outputs the square of the Euclidian distance between its input vector and its weight vector. Each output measures how much the image belongs to a particular digit class. The cross entropy cost function is now preferred, as it penalizes bad predictions much more, producing larger gradients and thus converging faster.

13.4.2 AlexNet

Table 13-2. AlexNet architecture

LayerTypeMapsSizeKernel SizeStridePaddingActivation
OutFully Connected − - 1000 − - − - − - Softmax
F9Fully Connected − - 4096 − - − - − - ReLU
F8Fully Connected − - 4,096 − - − - − - ReLU
C7Convolution256 13 × 13 13\times 13 13×13 3 × 3 3\times 3 3×31SAMEReLu
C6Convolution384 13 × 13 13\times 13 13×13 3 × 3 3\times 3 3×31SAMEReLU
C5Convolution384 13 × 13 13\times 13 13×13 3 × 3 3\times 3 3×31SAMEReLU
S4Max Pooling256 13 × 13 13\times 13 13×13 3 × 3 3\times 3 3×32VALID − -
C3Convolution256 27 × 27 27\times 27 27×27 5 × 5 5\times 5 5×51SAMEReLU
S2Max Pooling96 27 × 27 27\times 27 27×27 3 × 3 3\times 3 3×32VALID − -
C1Convolution96 55 × 55 55\times 55 55×55 11 × 11 11\times 11 11×114SAMEReLU
InInput3(RGB) 224 × 224 224\times 224 224×224 − - − - − - − -

To reduce overfitting, the authors used two regularization techniques we discussed in previous chapters: first they applied dropout (with a 50% dropout rate) during training to the outputs of layers F8 and F9. Second, they performed data augmentation by randomly shifting the training images by various offsets, flipping them horizontally, and changing the lighting conditions.

AlexNet also uses a competitive normalization step immediately after the ReLU step of layers C1 and C3, called local response normalization (LRN). This form of normalization makes the neurons that most strongly activate inhibit neurons at the same location but in neighboring feature maps (such competitive activation has been observed in biological neurons). This encourages different feature maps to specialize, pushing them apart and forcing them to explore a wider range of features, ultimately improving generalization.

Equation 13-2. Local response normalization
b i = a i ( k + α ∑ j = j low j high a j 2 ) − β  with  { j high = min ⁡ ( i + r 2 , f n − 1 ) j low = max ⁡ ( 0 , i − r 2 ) b_i=a_i\left(k+\alpha\sum_{j=j_{\textrm{low}}}^{j_{\textrm{high}}}a_j^2\right)^{-\beta}\textrm{ with }\begin{cases}j_\textrm{high}=\min\left(i+\frac{r}{2}, f_n-1\right)\\j_{\textrm{low}}=\max(0,i-\frac{r}{2})\end{cases} bi=aik+αj=jlowjhighaj2β with {jhigh=min(i+2r,fn1)jlow=max(0,i2r)

  • b i b_i bi is the normalized output of the neuron located in feature map i i i, at some row u u u and column v v v (note that in this equation we consider only neurons located at this row and column, so u u u and v v v are not shown).
  • a i a_i ai is the activation of that neuron after the ReLU step, but before normalization.
  • k , α , β k, \alpha, \beta k,α,β, and r r r are hyperparameters. k k k is called the bias, and r r r is called the depth radius.
  • f n f_n fn is the number of feature maps.

In AlexNet, the hyperparameters are set as follows: r = 2 r = 2 r=2, α = 0.00002 \alpha = 0.00002 α=0.00002, β = 0.75 \beta = 0.75 β=0.75, and k = 1 k=1 k=1. This step can be implemented using TensorFlow’s local_response_normalization() operation.

A variant of AlexNet called ZF Net is essentially AlexNet with a few tweaked hyperparameters (number of feature maps, kernel size, stride, etc.).

13.4.3 GoogLeNet

The great performance of GoogLeNet came in large part from the fact that the network was much deeper than previous CNNs (see Figure 13-11). This was made possible by sub-networks called inception modules, which allow GoogLeNet to use parameters much more efficiently than previous architectures: GoogLeNet actually has 10 times fewer parameters than AlexNet (roughly 6 million instead of 60 million).

Figure 13-10 shows the architecture of an inception module. The notation “3 × 3 + 2(S)” means that the layer uses a 3 × 3 kernel, stride 2, and SAME padding. The input signal is first copied and fed to four different layers. All convolutional layers use the ReLU activation function. Note that the second set of convolutional layers uses different kernel sizes (1 × 1, 3 × 3, and 5 × 5), allowing them to capture patterns at different scales. Also note that every single layer uses a stride of 1 and SAME padding (even the max pooling layer), so their outputs all have the same height and width as their inputs. This makes it possible to concatenate all the outputs along the depth dimension in the final depth concat layer (i.e., stack the feature maps from all four top convolutional layers). This concatenation layer can be implemented in TensorFlow using the concat() operation, with axis=3 (axis 3 is the depth).

Layers with 1 × 1 1\times 1 1×1 kernels serve two purposes:

  • First, they are configured to output many fewer feature maps than their inputs, so they serve as bottleneck layers, meaning they reduce dimensionality (of feature maps) . This is particularly useful before the 3 × 3 and 5 × 5 convolutions, since these are very computationally expensive layers.
  • Second, each pair of convolutional layers ([1 × 1, 3 × 3] and [1 × 1, 5 × 5]) acts like a single, powerful convolutional layer, capable of capturing more complex patterns. Indeed, instead of sweeping a simple linear classifier across the image (as a single convolutional layer does), this pair of convolutional layers sweeps a two-layer neural network across the image.

In short, you can think of the whole inception module as a convolutional layer on steroids, able to output feature maps that capture complex patterns at various scales.

In news writing and politics, the phrase on steroids has become the go-to modifier for any new thing that is bigger and more advanced than a previous version.

https://grammarist.com/usage/on-steroids/

The number of feature maps output by each convolutional layer and each pooling layer is shown before the kernel size. The six numbers in the inception modules represent the number of feature maps output by each convolutional layer in the module (in the same order as in Figure 13-10). Note that all the convolutional layers use the ReLU activation function.

Let’s go through this network:

  • The first two layers divide the image’s height and width by 4 (so its area is divided
    by 16), to reduce the computational load.
  • Then the local response normalization layer ensures that the previous layers learn
    a wide variety of features.
  • Two convolutional layers follow, where the first acts like a bottleneck layer. As explained earlier, you can think of this pair as a single smarter convolutional layer.
  • Again, a local response normalization layer ensures that the previous layers capture a wide variety of patterns.
  • Next a max pooling layer reduces the image height and width by 2, again to speed up computations.
  • Then comes the tall stack of nine inception modules, interleaved with a couple max pooling layers to reduce dimensionality and speed up the net.
  • Next, the average pooling layer uses a kernel the size of the feature maps with VALID padding, outputting 1 × 1 feature maps: this surprising strategy is called global average pooling. It effectively forces the previous layers to produce feature maps that are actually confidence maps for each target class (since other kinds of features would be destroyed by the averaging step). This makes it unnecessary to have several fully connected layers at the top of the CNN (like in AlexNet), considerably reducing the number of parameters in the network and limiting the risk of overfitting.
  • The last layers are self-explanatory: dropout for regularization, then a fully connected layer with a softmax activation function to output estimated class probabilities.

13.4.4 ResNet

The key to being able to train such a deep network with 152 layers is to use skip connections (also called shortcut connections): the signal feeding into a layer is also added to the output of a layer located a bit higher up the stack.

When training a neural network, the goal is to make it model a target function h ( x ) h(\textbf x) h(x). If you add the input x \textbf x x to the output of the network (i.e., you add a skip connection), then the network will be forced to model f ( x ) = h ( x ) – x f(\textbf x) = h(\textbf x) – \textbf x f(x)=h(x)x rather than h ( x ) h(\textbf x) h(x). This is called residual learning (see Figure 13-12).

At present, the ResNet architecture is both the most powerful and arguably the simplest, so it is really the one you should probably use for now.

TensorFlow Convolution Operations

  • conv1d() creates a convolutional layer for 1D inputs. This is useful, for example, in natural language processing, where a sentence may be represented as a 1D array of words, and the receptive field covers a few neighboring words.
  • conv3d() creates a convolutional layer for 3D inputs, such as 3D PET (positron-emission tomography) scan.
  • atrous_conv2d() creates an atrous convolutional layer (“à trous” is French for “with holes”). This is equivalent to using a regular convolutional layer with a filter dilated by inserting rows and columns of zeros (i.e., holes). For example, a 1 ×3 filter equal to [[1,2,3]] may be dilated with a dilation rate of 4, resulting in a dilated flter [[1, 0, 0, 0, 2, 0, 0, 0, 3]]. This allows the convolutional layer to have a larger receptive field at no computational price and using no extra parameters.
  • conv2d_transpose() creates a transpose convolutional layer, sometimes called a
    deconvolutional layer, which upsamples an image. It does so by inserting zeros
    between the inputs, so you can think of this as a regular convolutional layer using
    a fractional stride. Upsampling is useful, for example, in image segmentation: in a typical CNN, feature maps get smaller and smaller as you progress through the network, so if you want to output an image of the same size as the input, you need an upsampling layer.
  • depthwise_conv2d() creates a depthwise convolutional layer that applies every filter to every individual input channel independently. Thus, if there are f n f_n fn filters and f n ′ f_{n′} fn input channels, then this will output f n × f n ′ f_n \times f_{n′} fn×fn feature maps.
  • separable_conv2d() creates a separable convolutional layer that first acts like a
    depthwise convolutional layer, then applies a 1 × 1 convolutional layer to the resulting feature maps. This makes it possible to apply filters to arbitrary sets of inputs channels.
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值