logistic regression can determine whether to recommend cesarean delivery（应用方向）
naive Bayes can separate legitimate e-mail from spam e-mail（应用方向）
It is not surprising that the choice of representation has an enormous effect on the performance of machine learning algorithms.Input x is often true for input x + epsilon for a small epsilon. This is called the smoothness prior and is exploited in most applications of machine learning that involve real numbers.Many artificial intelligence tasks can be solved by designing the right set of features to extract for that task, then providing these features to a simple machine learning algorithm. For example,a useful feature for speaker identification from sound is the pitch. One solution to this problem is to use machine learning to discover not only the map-ping from representation to output but also the representation itself. This approach is known as representation learning. When designing features or algorithms for learning features, our goal is usually to separate the factors of variation that explain the observed data. Deeplearning is a particular kind of machine learning that achieves great power and fiexibility by learning to represent the world as a nested hierarchy of concepts, with each concept defined in relation to simpler concepts. Representation learning algorithms can either be supervised, unsupervised, or a combination of both (semi-supervised). Deep learning has not only changed the field of machine learning and influenced our understanding of human perception, it has revolutionized application areas such as speech recognition and image understanding. Pylearn2 is a machine learning and deep learning library. A live online resourcehttp://www.deeplearning.net/book/guidelines allows practitioners and researchers to share their questions and experience and keep abreast of developments in the art of deep learning.
1.2 Machine Learning
Human brains also observe their own actions, which infiuence the world around them, and it appears that human brains try to learn the statistical dependencies between these actions and their consequences, so as to maximize future rewards. Bayesian machine learning attempts to formalize these priors as probability distributions and once this is done, Bayes theorem and the laws of probability (discussed in Chapter 3) dictates what the right predictions should be.Overfitting occurs when capacityis too large compared to the number of examples, so that the learner does a good job on the training examples (it correctly guesses that they are likely configurations) but a very poor one on new examples (it does not discriminate well between the likely configurations and the unlikely one). Underfitting occurs when instead the learner does not have enough capacity, so that even on the training examples it is not able to make good guesses: it does not manage to capture enough of the information present in the training examples, maybe because it does not have enough degrees of freedom to fit all the training examples. The main reason we get underfitting (especially with deep learning) is not that we choose to have insuficient capacity but because obtaining high capacity in a learner that has strong priors often involves dificult numerical optimization. Numerical optimization methods attempt to find a configuration of some variables (often called parameters, in machine learning) that minimizes or maximizes some given function of these parameters, which we call objective function or training criterion. In the case of most deep learning algorithms, this dificulty in optimizing the training criterion is related to the fact that it is not convex in the parameters of the model.We believe that the issue of underfitting is central in deep learning algorithms and deserves a lot more attention from researchers. Another machine learning concept that turns out to be important to understand many deep learning algorithms is that of manifold learning. The manifold learning hypothesis (Cayton, 2005; Narayanan and Mitter, 2010) states that probability is concentrated around regions called manifolds, i.e., that most configurations are unlikely and that probable configurations are neighbors of other probable configurations. We define the dimension of a manifold as the number of orthogonal directions in which one can move and stay among probable configurations. This hypothesis of probability concentration seems to hold for most AI tasks of interest, as can be verified by the fact that most configurations of input variables are unlikely (pick pixel values randomly and you will almost never obtain a natural-looking image).
1.3 Historical Perspective and Neural Networks
Modern deep learning research takes a lot of its inspiration from neural network research of previous decades. Other major intellectual sources of concepts found in deep learning research include works on probabilistic modeling and graphical models, as well as works on manifold learning. The breakthrough came from a semi-supervised procedure:using unsupervised learning to learn one layer of features at a time and then fine-tuning the whole system with labeled data (Hinton et al., 2006; Bengio et al., 2007; Ranzatoet al., 2007), described in Chapter 10. This initiated a lot of new research and other ways of successfully training deep nets emerged. Even though unsupervised pre-trainingis sometimes unnecessary for datasets with a very large number of labels, it was the early success of unsupervised pre-training that led many new researchers to investigate deep neural networks. In particular, the use of rectifiers (Nair and Hinton, 2010b) as non-linearity and appropriate initialization allowing information to fiow well both forward(to produce predictions from input) and backward (to propagate error signals) were later shown to enable training very deep supervised networks (Glorot et al., 2011a) without unsupervised pre-training.
1.4 Recent Impact of Deep Learning Research
Since 2010, deep learning has had spectacular practical successes. It has led to much better acoustic models that have dramatically improved the state of the art in speech recognition. Deep neural nets are now used in deployed speech recognition systems including voice search on the Android (Dahl et al., 2010; Deng et al., 2010; Seide et al.,2011; Hinton et al., 2012). Deep convolutional nets have led to major advances in the state of the art for recognizing large numbers of difierent types of objects in images(now deployed in Google+ photo search). They have also had spectacular successes for pedestrian detection and image segmentation (Sermanet et al., 2013; Farabet et al.,2013; Couprie et al., 2013) and yielded superhuman performance in trafic sign classification (Ciresan et al., 2012). An organization called Kaggle runs machine learning competitions on the web. Deep learning has had numerous successes in these competitions:
This has led Yann LeCun and Yoshua Bengio to create a new conference on the subject. They called it the International Conference on Learning Representations (ICLR) , to broaden the scope from just deep learning to the more general subject of representation learning (which includes topics such as sparse coding, that learns shallow representations, because shallow representation-learners can be used as building blocks for deep representation-learners). In the examples of outstanding applications of deep learning described above, the impressive breakthroughs have mostly been achieved with supervised learning techniques for deep architectures. We believe that some of the most important future progressin deep learning will hinge on achieving a similar impact in the unsupervised and semi-supervised cases. Even though the scaling behavior of stochastic gradient descent is theoretically very good in terms of computations per update, these observations suggest a numerical optimization challenge that must be addressed. In addition to these numerical optimization dificulties, scaling up large and deep neural networks as they currently stand would require a substantial increase in computing power, which remains a limiting factor of our research. To train much larger models with the current hardware (or the hardware likely to be available in the next few years) will require a change in design and/or the ability to efiectively exploit parallel computation. These raise non-obvious questions where fundamental research is also needed. Furthermore, some of the biggest challenges remain in front of us regarding unsupervised deep learning. Powerful unsupervised learning is important for many reasons:fi Unsupervised learning allows a learner to take advantage of unlabeled data. Most of the data available to machines (and to humans and animals) is unlabeled, i.e.,without a precise and symbolic characterization of its semantics and of the outputs desired from a learner. Humans and animals are also motivated, and this guidesresearch into learning algorithms based on a reinforcement signal, which is much weaker than the signal required for supervised learning.
To summarize, some of the challenges we view as important for future break throughsin deep learning are the following:
1. How should we deal with the fundamental challenges behind unsupervised learning,such as intractable inference and sampling See Chapters 15, 16, and 17.
2. How can we build and train much larger and more adaptive and reconfigurable deep architectures, thus maximizing the advantage one can draw from larger datasets See Chapter 8.
3. How can we improve the ability of deep learning algorithms to disentangle the underlying factors of variation, or put more simply, make sense of the world around us See Chapter 14 on this very basic question about what is involved in learning a good representation.