论文阅读笔记(五十三):Understanding Deep Convolutional Networks

Abstract
Deep convolutional networks provide state of the art classifications and regressions results over many high-dimensional problems. We review their architecture, which scatters data with a cascade of linear filter weights and non-linearities. A mathematical framework is introduced to analyze their properties. Computations of invariants involve multiscale contractions, the linearization of hierarchical symmetries, and sparse separations. Applications are discussed.

Introduction

Supervised learning is a high-dimensional interpolation problem. We approximate a function f(x) from q training samples {xi,f(xi)}i≤q, where x is a data vector of very high dimension d. This dimension is often larger than 106, for images or other large size signals. Deep convolutional neural networks have recently obtained remarkable experimental results [21]. They give state of the art performances for image classification with thousands of complex classes [19], speech recognition [17], bio-medical applications [22], natural language understanding [30], and in many other domains. They are also studied as neuro-physiological models of vision [4].

Multilayer neural networks are computational learning architectures which propagate the input data across a sequence of linear operators and simple non-linearities. The properties of shallow networks, with one hidden layer, are well understood as decompositions in families of ridge functions [10]. However, these approaches do not extend to networks with more layers. Deep convolutional neural networks, introduced by Le Cun [20], are implemented with linear convolutions followed by non-linearities, over typically more than 5 layers. These complex programmable machines, defined by potentially billions of filter weights, bring us to a different mathematical world.

Many researchers have pointed out that deep convolution networks are computing progressively more powerful invariants as depth increases [4, 21], but relations with networks weights and non-linearities are complex. This paper aims at clarifying important principles which govern the properties of such networks, but their architecture and weights may differ with applications. We show that computations of invariants involve multiscale contractions, the linearization of hierarchical symmetries, and sparse separations. This conceptual basis is only a first step towards a full mathematical understanding of convolutional network properties.

In high dimension, x has a considerable number of parameters, which is a dimensionality curse. Sampling uniformly a volume of dimension d requires a number of samples which grows exponentially with d. In most applications, the number q of training samples rather grows linearly with d. It is possible to approximate f(x) with so few samples, only if f has some strong regularity properties allowing to ultimately reduce the dimension of the estimation. Any learning algorithm, including deep convolutional networks, thus relies on an underlying assumption of regularity. Specifying the nature of this regularity is one of the core mathematical problem.

One can try to circumvent the curse of dimensionality by reducing the variability or the dimension of x, without sacrificing the ability to approximate f(x). This is done by defining a new variable Φ(x) where Φ is a contractive operator which reduces the range of variations of x, while still separating different values of f: Φ(x) ̸= Φ(x′) if f(x) ̸= f(x′). This separation-contraction trade-off needs to be adjusted to the properties of f.

Linearization is a strategy used in machine learning to reduce the dimension with a linear projector. A low-dimensional linear projection of x can separate the values of f if this function remains constant in the direction of a high-dimensional linear space. This is rarely the case, but one can try to find Φ(x) which linearizes high-dimensional domains where f(x) remains constant. The dimension is then reduced by applying a low-dimensional linear projector on Φ(x). Finding such a Φ is the dream of kernel learning algorithms, explained in Section 2.

Deep neural networks are more conservative. They progressively contract the space and linearize transformations along which f remains nearly constant, to preserve separation. Such directions are defined by linear operators which belong to groups of local symmetries, introduced in Section 3. To understand the difficulty to linearize the action of high-dimensional groups of operators, we begin with the groups of translations and diffeomorphisms, which deform signals. They capture essential mathematical properties that are extended to general deep network symmetries, in Section 7.

To linearize diffeomorphisms and preserve separability, Section 4 shows that we must separate the variations of x at different scales, with a wavelet transform. This is implemented with multiscale filter convolutions, which are building blocks of deep convolution filtering. General deep network architectures are introduced in Section 5. They iterate on linear operators which filter and linearly combine different channels in each network layer, followed by contractive non-linearities.

To understand how non-linear contractions interact with linear operators, Section 6 begins with simpler networks which do not recombine channels in each layer. It defines a non-linear scattering transform, introduced in [24], where wavelets have a separation and linearization role. The resulting contraction, linearization and separability properties are reviewed. We shall see that sparsity is important for separation.

Section 7 extends these ideas to a more general class of deep convolutional networks. Channel combinations provide the flexibility needed to extend translations to larger groups of local symmetries adapted to f. The network is structured by factorizing groups of symmetries, in which case all linear operators are generalized convolutions. Computations are ultimately performed with filter weights, which are learned. Their relation with groups of symmetries is explained. A major issue is to preserve a separation margin across classification frontiers. Deep convolutional networks have the ability to do so, by separating network fibers which are progressively more invariant and specialized. This can give rise to invariant grandmother type neurons observed in deep networks [1]. The paper studies architectures as opposed to computational learning of network weights, which is an outstanding optimization issue [21].

这里写图片描述

Figure 1: Wavelet transform of an image x(u), computed with a cascade of convolutions with filters over J = 4 scales and K = 4 orientations. The low-pass and K = 4 band-pass filters are shown on the first arrows.

这里写图片描述

Figure 2: A convolution network iteratively computes each layer xj by transforming the previous layer xj−1, with a linear operator Wj and a pointwise non-linearity ρ.

这里写图片描述

Figure 3: First row: original images. Second row: realization of a Gaussian process with same second covariance moments. Third row: reconstructions from first and second order scattering coefficients.

Linearization, Projection and Separability

Invariants, Symmetries and Diffeomorphisms

Contractions and Scale Separation with Wavelets

Deep Convolutional Neural Network Architectures

Scattering on the Translation Group

Multiscale Hierarchical Convolutional Networks

Conclusion

This paper provides a mathematical framework to analyze contraction and separation properties of deep convolutional networks. In this model, network filters are guiding non-linear contractions, to reduce the data variability in directions of local symmetries. The classification margin can be controlled by sparse separations along network fibers. Network fibers combine invariances along groups of symmetries and distributed pattern representations, which could be sufficiently stable to explain transfer learning of deep networks [21]. However, this is only a framework. We need complexity measures, approximation theorems in spaces of high-dimensional functions, and guaranteed convergence of filter optimization, to fully understand the mathematics of these convolution networks.

Besides learning, there are striking similarities between these multiscale mathematical tools and the treatment of symmetries in particle and statistical physics [15]. One can expect a rich cross fertilization between high-dimensional learning and physics, through the development of a common mathematical language.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值