Applications（2）

最新推荐文章于 2021-09-02 12:47:43 发布

PeterBishop0

最新推荐文章于 2021-09-02 12:47:43 发布

阅读量324

点赞数 1

分类专栏：深度学习花书笔记文章标签：深度学习机器学习

本文链接：https://blog.csdn.net/qq_40061421/article/details/119843565

版权

深度学习花书笔记专栏收录该内容

28 篇文章 3 订阅

订阅专栏

Computer Vision

Computer vision has traditionally been one of the most active research areas for deep learning applications, because vision is a task that is effortless for humans and many animals but challenging for computers ( Ballard et al. , 1983 ). Many of the most popular standard benchmark tasks for deep learning algorithms are forms of object recognition or optical character recognition.
Computer vision is a very broad field encompassing（包罗万象） a wide variety of ways of processing images, and an amazing diversity of applications. Applications of computer vision range from reproducing human visual abilities, such as recognizing faces, to creating entirely new categories of visual abilities. As an example of the latter category, one recent computer vision application is to recognize sound waves from the vibrations they induce in objects visible in a video ( Davis et al. , 2014). Most deep learning research on computer vision has not focused on such exotic applications that expand the realm of what is possible with imagery but rather a small core of AI goals aimed at replicating human abilities. Most deep learning for computer vision is used for object recognition or detection of some form, whether this means reporting which object is present in an image, annotating an image with bounding boxes around each object, transcribing a sequence of symbols from an image, or labeling each pixel in an image with the identity of the object it belongs to. Because generative modeling has been a guiding principle of deep learning research, there is also a large body of work on image synthesis using deep models. While image synthesis ex nihilo(从无到有) is usually not considered a computer vision endeavor, models capable of image synthesis are usually useful for image restoration, a computer vision task involving repairing defects in images or removing objects from images.

Preprocessing

Many application areas require sophisticated preprocessing because the original input comes in a form that is difficult for many deep learning architectures to represent. Computer vision usually requires relatively little of this kind of preprocessing.
The images should be standardized so that their pixels all lie in the same, reasonable range, like $[0, 1]$ or $[- 1, 1] .$ Mixing images that lie in $[0, 1]$ with images that lie in $[0, 255]$ will usually result in failure. Formatting images to have the same scale is the only kind of preprocessing that is strictly necessary. Many computer vision architectures require images of a standard size, so images must be cropped or scaled to fit that size. Even this rescaling is not always strictly necessary. Some convolutional models accept variably-sized inputs and dynamically adjust the size of their pooling regions to keep the output size constant(Waibel et al., 1989). Other convolutional models have variable-sized output that automatically scales in size with the input, such as models that denoise or label each pixel in an image (Hadsell et al., 2007 ).
Dataset augmentation may be seen as a way of preprocessing the training set only. Dataset augmentation is an excellent way to reduce the generalization error of most computer vision models.
A related idea applicable at test time is to show the model many different versions of the same input (for example, the same image cropped at slightly different locations) and have the different instantiations of the model vote to determine the output. This latter idea can be interpreted as an ensemble approach, and helps to reduce generalization error.
Other kinds of preprocessing are applied to both the train and the test set with the goal of putting each example into a more canonical form in order to reduce the amount of variation that the model needs to account for. Reducing the amount of variation in the data can both reduce generalization error and reduce the size of the model needed to fit the training set. Simpler tasks may be solved by smaller models, and simpler solutions are more likely to generalize well. Preprocessing of this kind is usually designed to remove some kind of variability in the input data that is easy for a human designer to describe and that the human designer is confident has no relevance to the task.
When training with large datasets and large models, this kind of preprocessing is often unnecessary, and it is best to just let the model learn which kinds of variability it should become invariant to. For example, the AlexNet system for classifying ImageNet only has one preprocessing step: subtracting the mean across training examples of each pixel (Krizhevs et al., 2012).

Contrast Normalization

One of the most obvious sources of variation that can be safely removed for many tasks is the amount of contrast in the image. Contrast simply refers to the magnitude of the difference between the bright and the dark pixels in an image. There are many ways of quantifying the contrast of an image. In the context of deep learning, contrast usually refers to the standard deviation of the pixels in an image or region of an image. Suppose we have an image represented by a tensor $\mathbf{X} \in \mathbb{R}^{r \times c \times 3}$ , with $X_{i, j, 1}$ being the red intensity at row $i$ and column $j, X_{i, j, 2}$ giving the green intensity and $X_{i, j, 3}$ giving the blue intensity. Then the contrast of the entire image is given by
$\sqrt{\frac{1}{3 r c} \sum_{i=1}^{r} \sum_{j=1}^{c} \sum_{k=1}^{3}\left(X_{i, j, k}-\overline{\mathbf{X}}\right)^{2}}$
where $\bar{X}$ is the mean intensity of the entire image:
$\overline{\mathbf{X}}=\frac{1}{3 r c} \sum_{i=1}^{r} \sum_{j=1}^{c} \sum_{k=1}^{3} X_{i, j, k}$
Global contrast normalization (GCN) aims to prevent images from having varying amounts of contrast by subtracting the mean from each image, then rescaling it so that the standard deviation across its pixels is equal to some constant $s$ .
This approach is complicated by the fact that no scaling factor can change the contrast of a zero-contrast image (one whose pixels all have equal intensity). Images with very low but non-zero contrast often have little information content. Dividing by the true standard deviation usually accomplishes nothing more than amplifying sensor noise or compression artifacts in such cases. This motivates introducing a small, positive regularization parameter $\lambda$ to bias the estimate of the standard deviation. Alternately, one can constrain the denominator（分母） to be at least $\epsilon$ . Given an input image X, GCN produces an output image $\mathbf{X}^{\prime}$ , defined such that
$X_{i, j, k}^{\prime}=s \frac{X_{i, j, k}-\bar{X}}{\max \left\{\epsilon, \sqrt{\lambda+\frac{1}{3 r c} \sum_{i=1}^{r} \sum_{j=1}^{c} \sum_{k=1}^{3}\left(X_{i, j, k}-\bar{X}\right)^{2}}\right\}}$
Datasets consisting of large images cropped to interesting objects are unlikely to contain any images with nearly constant intensity. In these cases, it is safe to practically ignore the small denominator problem by setting $\lambda=0$ and avoid division by 0 in extremely rare cases by setting $\epsilon$ to an extremely low value like $10^{-8}$ . This is the approach used by Goodfellow et al. ( 2013a ) on the CIFAR-10 dataset. Small images cropped randomly are more likely to have nearly constant intensity, making aggressive regularization more useful. Coates et al. (2011) used $\epsilon=0$ and $\lambda=10$ on small, randomly selected patches drawn from CIFAR-10.
The scale parameter $s$ can usually be set to 1 , as done by Coates et al. (2011), or chosen to make each individual pixel have standard deviation across examples close to 1 , as done by Goodfellow et al. ( 2013a ).
The standard deviation in equation above is just a rescaling of the $L^{2}$ norm of the image (assuming the mean of the image has already been removed(mean = 0)). It is preferable to define GCN in terms of standard deviation rather than $L^{2}$ norm because the standard deviation includes division by the number of pixels, so GCN based on standard deviation allows the same $s$ to be used regardless of image size. However, the observation that the $L^{2}$ norm is proportional to the standard deviation can help build a useful intuition.
One can understand GCN as mapping examples to a spherical shell. See figure $12.1$ for an illustration. This can be a useful property because neural networks are often better at responding to directions in space rather than exact locations.
Responding to multiple distances in the same direction requires hidden units with collinear weight vectors but different biases. Such coordination can be difficult for the learning algorithm to discover. Additionally, many shallow graphical models have problems with representing multiple separated modes along the same line.
GCN avoids these problems by reducing each example to a direction rather than a direction and a distance.
Counterintuitively, there is a preprocessing operation known as sphering and it is not the same operation as GCN. Sphering does not refer to making the data lie on a spherical shell, but rather to rescaling the principal components to have equal variance, so that the multivariate normal distribution used by PCA has spherical contours. Sphering is more commonly known as whitening.
Global contrast normalization will often fail to highlight image features we would like to stand out, such as edges and corners. If we have a scene with a large dark area and a large bright area (such as a city square with half the image in the shadow of a building) then global contrast normalization will ensure there is a large difference between the brightness of the dark area and the brightness of the light area. It will not, however, ensure that edges within the dark region stand out.
This motivates local contrast normalization. Local contrast normalization ensures that the contrast is normalized across each small window, rather than over the image as a whole. See figure $12.2$ for a comparison of global and local contrast normalization.
Various definitions of local contrast normalization are possible. In all cases, one modifies each pixel by subtracting a mean of nearby pixels and dividing by a standard deviation of nearby pixels.
In some cases, this is literally the mean and standard deviation of all pixels in a rectangular window centered on the pixel to be modified (Pinto et al., 2008).
In other cases, this is a weighted mean and weighted standard deviation using Gaussian weights centered on the pixel to be modified.
In the case of color images, some strategies process different color channels separately while others combine information from different channels to normalize each pixel ( Sermanet et al. , 2012 ).
Local contrast normalization can usually be implemented efficiently by using separable convolution (see section 9.8 ) to compute feature maps of local means and local standard deviations, then using element-wise subtraction and element-wise division on different feature maps.
Local contrast normalization is a differentiable operation and can also be used as a nonlinearity applied to the hidden layers of a network, as well as a preprocessing operation applied to the input.
As with global contrast normalization, we typically need to regularize local contrast normalization to avoid division by zero. In fact, because local contrast normalization typically acts on smaller windows, it is even more important to regularize. Smaller windows are more likely to contain values that are all nearly the same as each other, and thus more likely to have zero standard deviation.

Dataset Augmentation

As described in section 7.4, it is easy to improve the generalization of a classifier by increasing the size of the training set by adding extra copies of the training examples that have been modified with transformations that do not change the class. Object recognition is a classification task that is especially amenable to this form of dataset augmentation because the class is invariant to so many transformations and the input can be easily transformed with many geometric operations. As described before, classifiers can benefit from random translations, rotations, and in some cases, flips of the input to augment the dataset. In specialized computer vision applications, more advanced transformations are commonly used for dataset augmentation. These schemes include random perturbation of the colors in an image (Krizhevsky et al., 2012) and nonlinear geometric distortions of the input (LeCun et al., 1998b)

Speech Recognition

The task of speech recognition is to map an acoustic signal containing a spoken natural language utterance into the corresponding sequence of words intended by the speaker. Let $\boldsymbol{X}=\left(\boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \ldots, \boldsymbol{x}^{(T)}\right)$ denote the sequence of acoustic input vectors (traditionally produced by splitting the audio into $\mathrm{~ms}$ frames). Most speech recognition systems preprocess the input using specialized hand-designed features, but some (Jaitly and Hinton, 2011) deep learning systems learn features from raw input. Let $\boldsymbol{y}=\left(y_{1}, y_{2}, \ldots, y_{N}\right)$ denote the target output sequence (usually a sequence of words or characters). The automatic speech recognition (ASR) task consists of creating a function $f_{\mathrm{ASR}}^{*}$ that computes the most probable linguistic sequence $\boldsymbol{y}$ given the acoustic sequence $\boldsymbol{X}$ :
$f_{\mathrm{ASR}}^{*}(\boldsymbol{X})=\underset{\boldsymbol{y}}{\arg \max } P^{*}(\mathbf{y} \mid \mathbf{X}=\boldsymbol{X})$
where $P^{*}$ is the true conditional distribution relating the inputs $\boldsymbol{X}$ to the targets $\boldsymbol{y}$
Since the 1980 s and until about 2009-2012, state-of-the art speech recognition systems primarily combined hidden Markov models (HMMs) and Gaussian mixture models (GMMs).
GMMs modeled the association between acoustic features and phonemes (Bahl et al., 1987), while HMMs modeled the sequence of phonemes.
The GMM-HMM model family treats acoustic waveforms as being generated by the following process: first an HMM generates a sequence of phonemes and discrete sub-phonemic states (such as the beginning, middle, and end of each phoneme), then a GMM transforms each discrete symbol into a brief segment of audio waveform. Although GMM-HMM systems dominated ASR until recently, speech recognition was actually one of the first areas where neural networks were applied, and numerous ASR systems from the late $\mathrm{~s}$ and early $\mathrm{~s}$ used neural nets(Bourlard and Wellekens , 1989 ; Waibel et al., 1989 ; Robinson and Fallside , 1991 ; Bengio et al., 1991 , 1992 ; Konig et al., 1996 ). At the time, the performance of ASR based on neural nets approximately matched the performance of GMM-HMM systems. For example, Robinson and Fallside (1991) achieved $\%$ phoneme error rate on the TIMIT (Ga al., 1993 ) corpus (with 39 phonemes to discriminate between), which was better than or comparable to HMM-based systems. Since then, TIMIT has been a benchmark for phoneme recognition, playing a role similar to the role MNIST plays for object recognition. However, because of the complex engineering involved in software systems for speech recognition and the effort that had been invested in building these systems on the basis of GMM-HMMs, the industry did not see a compelling argument for switching to neural networks. As a consequence, until the late $\mathrm{~s}$ , both academic and industrial research in using neural nets for speech recognition mostly focused on using neural nets to learn extra features for GMM-HMM systems.
Later, with much larger and deeper models and much larger datasets, recognition accuracy was dramatically improved by using neural networks to replace GMMs for the task of associating acoustic features to phonemes (or sub-phonemic states) Starting in 2009 , speech researchers applied a form of deep learning based on unsupervised learning to speech recognition. This approach to deep learning was based on training undirected probabilistic models called restricted Boltzmann machines (RBMs) to model the input data. RBMs will be described in part III. To solve speech recognition tasks, unsupervised pretraining was used to build deep feedforward networks whose layers were each initialized by training an RBM. These networks take spectral acoustic representations in a fixed-size input window (around a center frame) and predict the conditional probabilities of HMM states for that center frame. Training such deep networks helped to significantly improve the recognition rate on TIMIT (Mohamed et al., $\mathrm{a}$ ), bringing down the phoneme error rate from about $\%$ to $\%$ . See (2012b) for ar analysis of reasons for the success of these models. Extensions to the basic phone recognition pipeline included the addition of speaker-adaptive features (M. al., 2011) that further reduced the error rate. This was quickly followed up by work to expand the architecture from phoneme recognition (which is what TIMIT is focused on) to large-vocabulary speech recognition (Dahl et al., 2012) which involves not just recognizing phonemes but also recognizing sequences of words from a large vocabulary. Deep networks for speech recognition eventually shifted from being based on pretraining and Boltzmann machines to being based on techniques such as rectified linear units and dropout( Zeiler et al. , 2013 ; Dahl et al., 2013 ). By that time, several of the major speech groups in industry had started exploring deep learning in collaboration with academic researchers. Hinton et al. ( 2012a )describe the breakthroughs achieved by these collaborators, which are now deployed in products such as mobile phones.
Later, as these groups explored larger and larger labeled datasets and incorporated some of the methods for initializing, training, and setting up the architecture of deep nets, they realized that the unsupervised pretraining phase was either unnecessary or did not bring any significant improvement.
These breakthroughs in recognition performance for word error rate in speech recognition were unprecedented（空前的） (around $\%$ improvement) and were following a long period of about ten years during which error rates did not improve much with the traditional GMM-HMM technology, in spite of the continuously growing size of training sets (see figure $2.4$ of Deng and Yu (2014)). This created a rapid shift in the speech recognition community towards deep learning. In a matter of roughly two years, most of the industrial products for speech recognition incorporated deep neural networks and this success spurred a new wave of research into deep learning algorithms and architectures for ASR, which is still ongoing today.
One of these innovations was the use of convolutional networks (Sainath et al., 2013) that replicate weights across time and frequency, improving over the earlier time-delay neural networks that replicated weights only across time. The new two-dimensional convolutional models regard the input spectrogram not as one long vector but as an image, with one axis corresponding to time and the other to frequency of spectral components.
Another important push, still ongoing, has been towards end-to-end deep learning speech recognition systems that completely remove the HMM. The first major breakthrough in this direction came from Graves et al. (2013) who trained a deep LSTM RNN (see section 10.10), using MAP inference over the frame-tophoneme alignment, as in LeCun et al. (1998b) and in the CTC framework (Graves al., 2006; Graves, 2012). A deep RNN (Graves et al., 2013) has state variables from several layers at each time step, giving the unfolded graph two kinds of depth: ordinary depth due to a stack of layers, and depth due to time unfolding. This work brought the phoneme error rate on TIMIT to a record low of $\%$ . See 21. (2014a) and Chung et al. (2014) for other variants of deep RNNs, applied in other settings.
Another contemporary step toward end-to-end deep learning ASR is to let the system learn how to “align” the acoustic-level information with the phonetic-level information ( Chorowski et al. , 2014 ; Lu et al. , 2015 ).