对于LeCun、Bengio和Hinton三位大神联合发表于 Nature 的综述文章“Deep Learning”，许多深度学习从业者评论认为写得既全面又深入浅出，读来非常过瘾。不过，另一尊大神Juergen Schmidhuber并不这么认为。后者今天发表评论文章“Critique of Paper by “Deep Learning Conspiracy” (Nature 521 p 436)”，列出9条理由，指责三巨头的文章没有足够尊重前人的成果，没有提及深度学习之父、推出第一代可用的深度学习算法的Alexey Grigorevich Ivakhnenko，没有引用远古的BP思想等一些研究成果，包括没有引用到Juergen Schmidhuber本人之前在RNN领域的一些工作。
然而LeCun给出霸气的一般性回复（他认为逐条反驳毫无意义）：虽然（在文章所引用的出处）之前确实有不少人想到了诸如链式法则（chain rule）、卷积（ConvNets）之类的技术，但这并不等于完全真正地发明BP算法，事实上他们对于这种思想在机器学习中应该如何应用并没有足够的认识，也没有成功的实现；也有极少数人很早想到可以用链式法则训练机器，用反向信号训练多级系统，但他们也没有能够付诸实践；事实上正是他本人在1987年的博士论文开始在控制论中建立起BP算法和伴随方式（adjoint method）之间的联系。
Juergen Schmidhuber：Critique of Paper by “Deep Learning Conspiracy” (Nature 521 p 436)
Machine learning is the science of credit assignment. The machine learning community itself profits from proper credit assignment to its members. The inventor of an important method should get credit for inventing it. She may not always be the one who popularizes it. Then the popularizer should get credit for popularizing it (but not for inventing it). Relatively young research areas such as machine learning should adopt the honor code of mature fields such as mathematics: if you have a new theorem, but use a proof technique similar to somebody else’s, you must make this very clear. If you “re-invent” something that was already known, and only later become aware of this, you must at least make it clear later.
As a case in point, let me now comment on a recent article in Nature (2015) about “deep learning” in artificial neural networks (NNs), by LeCun & Bengio & Hinton (LBH for short), three CIFAR-funded collaborators who call themselves the “deep learning conspiracy” (e.g., LeCun, 2015). They heavily cite each other. Unfortunately, however, they fail to credit the pioneers of the field, which originated half a century ago. All references below are taken from the recent deep learning overview (Schmidhuber, 2015), except for a few papers listed beneath this critique focusing on nine items.
LBH’s survey does not even mention the father of deep learning, Alexey Grigorevich Ivakhnenko, who published the first general, working learning algorithms for deep networks (e.g., Ivakhnenko and Lapa, 1965). A paper from 1971 already described a deep learning net with 8 layers (Ivakhnenko, 1971), trained by a highly cited method still popular in the new millennium. Given a training set of input vectors with corresponding target output vectors, layers of additive and multiplicative neuron-like nodes are incrementally grown and trained by regression analysis, then pruned with the help of a separate validation set, where regularisation is used to weed out superfluous nodes. The numbers of layers and nodes per layer can be learned in problem-dependent fashion.
LBH discuss the importance and problems of gradient descent-based learning through backpropagation (BP), and cite their own papers on BP, plus a few others, but fail to mention BP’s inventors. BP’s continuous form was derived in the early 1960s (Bryson, 1961; Kelley, 1960; Bryson and Ho, 1969). Dreyfus (1962) published the elegant derivation of BP based on the chain rule only. BP’s modern efficient version for discrete sparse networks (including FORTRAN code) was published by Linnainmaa (1970). Dreyfus (1973) used BP to change weights of controllers in proportion to such gradients. By 1980, automatic differentiation could derive BP for any differentiable graph (Speelpenning, 1980). Werbos (1982) published the first application of BP to NNs, extending thoughts in his 1974 thesis (cited by LBH), which did not have Linnainmaa’s (1970) modern, efficient form of BP. BP for NNs on computers 10,000 times faster per Dollar than those of the 1960s can yield useful internal representations, as shown by Rumelhart et al. (1986), who also did not cite BP’s inventors.
LBH claim: “Interest in deep feedforward networks [FNNs] was revived around 2006 (refs 31-34) by a group of researchers brought together by the Canadian Institute for Advanced Research (CIFAR).” Here they refer exclusively to their own labs, which is misleading. For example, by 2006, many researchers had used deep nets of the Ivakhnenko type for decades. LBH also ignore earlier, closely related work funded by other sources, such as the deep hierarchical convolutional neural abstraction pyramid (e.g., Behnke, 2003b), which was trained to reconstruct images corrupted by structured noise, enforcing increasingly abstract image representations in deeper and deeper layers. (BTW, the term “Deep Learning” (the very title of LBH’s paper) was introduced to Machine Learning by Dechter (1986), and to NNs by Aizenberg et al (2000), none of them cited by LBH.)
LBH point to their own work (since 2006) on unsupervised pre-training of deep FNNs prior to BP-based fine-tuning, but fail to clarify that this was very similar in spirit and justification to the much earlier successful work on unsupervised pre-training of deep recurrent NNs (RNNs) called neural history compressors (Schmidhuber, 1992b, 1993b). Such RNNs are even more general than FNNs. A first RNN uses unsupervised learning to predict its next input. Each higher level RNN tries to learn a compressed representation of the information in the RNN below, to minimise the description length (or negative log probability) of the data. The top RNN may then find it easy to classify the data by supervised learning. One can even “distill” a higher, slow RNN (the teacher) into a lower, fast RNN (the student), by forcing the latter to predict the hidden units of the former. Such systems could solve previously unsolvable very deep learning tasks, and started our long series of successful deep learning methods since the early 1990s (funded by Swiss SNF, German DFG, EU and others), long before 2006, although everybody had to wait for faster computers to make very deep learning commercially viable. LBH also ignore earlier FNNs that profit from unsupervised pre-training prior to BP-based fine-tuning (e.g., Maclin and Shavlik, 1995). They cite Bengio et al.’s post-2006 papers on unsupervised stacks of autoencoders, but omit the original work on this (Ballard, 1987).
LBH write that “unsupervised learning (refs 91-98) had a catalytic effect in reviving interest in deep learning, but has since been overshadowed by the successes of purely supervised learning.” Again they almost exclusively cite post-2005 papers co-authored by themselves. By 2005, however, this transition from unsupervised to supervised learning was an old hat, because back in the 1990s, our unsupervised RNN-based history compressors (see above) were largely phased out by our purely supervised Long Short-Term Memory (LSTM) RNNs, now widely used in industry and academia for processing sequences such as speech and video. Around 2010, history repeated itself, as unsupervised FNNs were largely replaced by purely supervised FNNs, after our plain GPU-based deep FNN (Ciresan et al., 2010) trained by BP with pattern distortions (Baird, 1990) set a new record on the famous MNIST handwritten digit dataset, suggesting that advances in exploiting modern computing hardware were more important than advances in algorithms. While LBH mention the significance of fast GPU-based NN implementations, they fail to cite the originators of this approach (Oh and Jung, 2004).
In the context of convolutional neural networks (ConvNets), LBH mention pooling, but not its pioneer (Weng, 1992), who replaced Fukushima’s (1979) spatial averaging by max-pooling, today widely used by many, including LBH, who write: “ConvNets were largely forsaken by the mainstream computer-vision and machine-learning communities until the ImageNet competition in 2012,” citing Hinton’s 2012 paper (Krizhevsky et al., 2012). This is misleading. Earlier, committees of max-pooling ConvNets were accelerated on GPU (Ciresan et al., 2011a), and used to achieve the first superhuman visual pattern recognition in a controlled machine learning competition, namely, the highly visible IJCNN 2011 traffic sign recognition contest in Silicon Valley (relevant for self-driving cars). The system was twice better than humans, and three times better than the nearest non-human competitor (co-authored by LeCun of LBH). It also broke several other machine learning records, and surely was not “forsaken” by the machine-learning community. In fact, the later system (Krizhevsky et al. 2012) was very similar to the earlier 2011 system. Here one must also mention that the first official international contests won with the help of ConvNets actually date back to 2009 (three TRECVID competitions) - compare Ji et al. (2013). A GPU-based max-pooling ConvNet committee also was the first deep learner to win a contest on visual object discovery in large images, namely, the ICPR 2012 Contest on Mitosis Detection in Breast Cancer Histological Images (Ciresan et al., 2013). A similar system was the first deep learning FNN to win a pure image segmentation contest (Ciresan et al., 2012a), namely, the ISBI 2012 Segmentation of Neuronal Structures in EM Stacks Challenge.
LBH discuss their FNN-based speech recognition successes in 2009 and 2012, but fail to mention that deep LSTM RNNs had outperformed traditional speech recognizers on certain tasks already in 2007 (Fernández et al., 2007) (and traditional connected handwriting recognisers by 2009), and that today’s speech recognition conferences are dominated by (LSTM) RNNs, not by FNNs of 2009 etc. While LBH cite work co-authored by Hinton on LSTM RNNs with several LSTM layers, this approach was pioneered much earlier (e.g., Fernandez et al., 2007).
LBH mention recent proposals such as “memory networks” and the somewhat misnamed “Neural Turing Machines” (which do not have an unlimited number of memory cells like real Turing machines), but ignore very similar proposals of the early 1990s, on neural stack machines, fast weight networks, self-referential RNNs that can address and rapidly modify their own weights during runtime, etc (e.g., AMAmemory 2015). They write that “Neural Turing machines can be taught algorithms,” as if this was something new, although LSTM RNNs were taught algorithms many years earlier, even entire learning algorithms (e.g., Hochreiter et al., 2001b).
In their outlook, LBH mention “RNNs that use reinforcement learning to decide where to look” but not that they were introduced a quarter-century ago (Schmidhuber & Huber, 1991). Compare the more recent Compressed NN Search for large attention-directing RNNs (Koutnik et al., 2013).
One more little quibble: While LBH suggest that “the earliest days of pattern recognition” date back to the 1950s, the cited methods are actually very similar to linear regressors of the early 1800s, by Gauss and Legendre. Gauss famously used such techniques to recognize predictive patterns in observations of the asteroid Ceres.
LBH may be backed by the best PR machines of the Western world (Google hired Hinton; Facebook hired LeCun). In the long run, however, historic scientific facts (as evident from the published record) will be stronger than any PR. There is a long tradition of insights into deep learning, and the community as a whole will benefit from appreciating the historical foundations.
The contents of this critique may be used (also verbatim) for educational and non-commercial purposes, including articles for Wikipedia and similar sites.
I’m not going to go through your points one by one, which would be pointless, but I’ll make a few general remarks.
Yes lots and lots of people have used chain rule before [Rumelhart et al. 1986], lots of people figured you could multiply Jacobians in reverse order in a multi-step function (perhaps even going back to Gauss, Leibniz, Newton, and Lagrange). But did they all “invent backprop?” No! They did not realize how this could be used for machine learning and they sure didn’t implement it and made it work for that. Many people were looking for a multi-layer learning algorithm in the 60s (and a few in the 70s). If the backprop idea had been so obvious, many people would have found it and demonstrated it. The point is that no one managed to demonstrate a working instantiation until the mid 80s. Perhaps this idea is obvious in hindsight, but that’s a feature of many good ideas: they are obvious in hindsight. The same is true for ConvNets. It’s a pretty obvious idea in hindsight.
Yes, a few people actually figured out early on that you could use chain rule for training a machine (including Rumelhart by the way. It took him and Geoff Hinton several years to get it to work). Some people had the intuition that you could use backward signals to train a multi-stage system (e.g. system theorist A.M. Andrews in the early 70s). But did they reduce it to practice and did they manage to make it work? No. that didn’t really happen until the mid-1980s.
I actually was the one who originally made the connection between backprop and the adjoint method used in control theory (the Bryson, Kelly, Dreyfus methods). In my 1987 PhD thesis, I have a derivation of backprop using a Lagrangian formulation which was inspired by control theory methods. I eventually wrote a couple papers about this connection [“a theoretical framework for backpropagation” 1988].
How should we attribute credit? Let’s use an analogy, lots of people tried to build airplanes and helicopters in the late 19th century and early 20th century. Many people had the right sort of ideas. An airplane even took on its own power in 1890 (Clément Ader’s Eole), quite a few airplanes flew more or less well before 1903. But the Wright Brothers get most of the credit because they were the first ones to build a fully controllable airplane. Same for the helicopter. Lots of people tried to build helicopters in the early 20th century, and several took off. But the idea didn’t become practical until Sikorski’s refinement of the cyclic control and tail rotor in the late 30s and early 40s. Who should get credit? Leonardo da Vinci?
Also, ConvNets were indeed “largely forsaken by the mainstream computer-vision and machine-learning communities until the ImageNet competition in 2012.” I’m rather well positioned to speak about this, having had countless ConvNet papers rejected, published and ignored, and occasionally paid attention to, for over 15 years before your group even started playing with the idea. We did have quite a few successful vision applications of ConvNets before 2012, but they really did not gather much interest from the vision and learning communities, and certainly didn’t trigger the revolution that came after the 2012 ImageNet results.
Krizhevski, Sutskever and Hinton get a lot of credit for their work, and it’s well deserved. They used many of my ideas (and added a few), but you don’t see me complain about it. That’s how science and technology make progress.