深度学习深度前馈网络_深度学习前馈网络中的讲义第1部分

最新推荐文章于 2024-09-27 22:55:17 发布

weixin_26630173

最新推荐文章于 2024-09-27 22:55:17 发布

阅读量122

点赞数

文章标签：深度学习人工智能神经网络 python 机器学习

原文链接：https://towardsdatascience.com/lecture-notes-in-deep-learning-feedforward-networks-part-1-e74db01e54a8

版权

深度学习深度前馈网络

FAU深度学习讲义 (FAU Lecture Notes in Deep Learning)

These are the lecture notes for FAU’s YouTube Lecture “Deep Learning”. This is a full transcript of the lecture video & matching slides. We hope, you enjoy this as much as the videos. Of course, this transcript was created with deep learning techniques largely automatically and only minor manual modifications were performed. If you spot mistakes, please let us know!

这些是FAU YouTube讲座“ 深度学习 ”的讲义。 这是演讲视频和匹配幻灯片的完整记录。 我们希望您喜欢这些视频。 当然，此成绩单是使用深度学习技术自动创建的，并且仅进行了较小的手动修改。 如果发现错误，请告诉我们！

导航 (Navigation)

Previous Lecture / Watch this Video / Top Level / Next Lecture

上一个讲座 / 观看此视频 / 顶级 / 下一个讲座

Image for post — CC BY 4.0 from the 深度学习讲座中 Deep Learning Lecture. CC BY 4.0下的图像。

Welcome everybody to our lecture on deep learning! Today, we want to go into the topic. We want to introduce some of the important concepts and theories that have been fundamental to the field. Today’s topic will be feed-forward networks and feed-forward networks are essentially the main configuration of neural networks as we use them today. So in the next couple of videos, we want to talk about the first models and some ideas behind them. We also introduce a bit of theory. One important block will be about Universal function approximation where we will essentially show that neural networks are able to approximate any kind of function. This will then be followed by the introduction of the softmax function and some activations. In the end, we want to talk a bit about how to optimize such parameters and in particular, we will talk about the backpropagation algorithm.

欢迎大家参加我们的深度学习讲座！今天，我们想进入主题。我们想介绍一些对该领域至关重要的重要概念和理论。今天的主题将是前馈网络，而前馈网络本质上是我们今天使用它们时神经网络的主要配置。因此，在接下来的两个视频中，我们想讨论第一个模型及其背后的一些想法。我们还将介绍一些理论。一个重要的块将是关于通用函数逼近的，我们将从本质上证明神经网络能够逼近任何一种函数。然后将引入softmax函数和一些激活。最后，我们想谈谈如何优化这些参数，尤其是，我们将讨论反向传播算法。

So let’s start with the model and what you heard already is the perceptron. We already talked about this which was essentially a function that would map any high dimensional input to an inner product of the weight vector and the input. Then, we are only interested in the signed distance that is computed. You can interpret this essentially as you see above on the right-hand side. The decision boundary is shown in red and what you’re computing with the inner product is essentially a signed distance of a new sample to this decision boundary. If we consider only the sign, we can decide whether we are on one side or the other.

因此，让我们从模型开始，您已经听到的是感知器。我们已经讨论过这个了，它本质上是一个将任何高维输入映射到权重向量和输入的内积的函数。然后，我们只对计算出的有符号距离感兴趣。您可以按照上面右侧的方法从本质上解释这一点。决策边界以红色显示，您对内积的计算本质上是新样本到该决策边界的有符号距离。如果仅考虑标志，则可以决定我们在一侧还是另一侧。

Now, if you look at classical pattern recognition and machine learning, we would still follow a so-called pattern recognition pipeline. We have some measurement that is converted and pre-processed in order to increase the quality, e.g. decrease noise. In the pre-processing, we essentially stay in the same domain as the input. So if you have an image as input, the output of the pre-processing will also be an image, but with probably better properties towards the classification task. Then, we want to do feature extraction. You remember the example with the apples and pears. From these, we extract features which then result in some high dimensional vector space. We can then go ahead and do the classification.

现在，如果您研究经典模式识别和机器学习，我们仍然会遵循所谓的模式识别管道。我们对一些测量值进行了转换和预处理，以提高质量，例如降低噪声。在预处理中，我们基本上与输入保持在相同的域中。因此，如果您有图像作为输入，则预处理的输出也将是图像，但对于分类任务可能具有更好的属性。然后，我们要进行特征提取。您还记得苹果和梨的例子。从这些中，我们提取特征，然后产生一些高维向量空间。然后，我们可以继续进行分类。

Now, what we’ve seen in the perceptron is that we are able to model linear decision boundaries. This immediately then led to the observation that perceptrons cannot solve the logical exclusive or — the so-called XOR. You can see the visualization of the XOR problem above on the left-hand side. So, imagine you have some kind of distribution of classes where the top left and the bottom right is blue and the other class is bottom left and top right. This is inspired by the logical XOR function. You will not be able to separate those two point clouds with a single linear decision boundary. So, you either need curves or you use multiple lines. With a single perceptron, you will not be able to solve this problem. Because people have been arguing: “Look we can model logical functions with perceptrons. If we build perceptrons on perceptrons, we can essentially build all of the logic!”

现在，我们在感知器中看到的是我们能够对线性决策边界建模。然后，这立即导致观察到感知器无法解决逻辑异或或所谓的XOR。您可以在上方的左侧看到XOR问题的可视化。因此，假设您有某种类的分布，其中左上角和右下角是蓝色，另一个类是左下角和右上角。这是受逻辑XOR函数启发的。您将无法使用单个线性决策边界来分离这两个点云。因此，您需要曲线或使用多条线。使用单个感知器，您将无法解决此问题。因为人们一直在争论：“看起来我们可以用感知器对逻辑函数建模。如果我们在感知器上构建感知器，那么我们基本上可以构建所有逻辑！”

Now, if you can’t build XOR, then you’re probably not able to describe the entire logic and therefore, we will never achieve strong AI. This was a period of time when all funding to artificial intelligence research was tremendously cut down and people would not get any new grants. They would not get money to support the research. Hence, this period became known as the “AI Winter”.

现在，如果您无法构建XOR，那么您可能无法描述整个逻辑，因此，我们将永远无法实现强大的AI。在这段时间里，用于人工智能研究的所有资金都被大大削减了，人们将不会获得任何新的资助。他们没有钱来支持这项研究。因此，这一时期被称为“人工智能冬季”。

Things changed with the introduction of the multi-layer perceptron. This is now the expansion of the perceptron. You do not just do a single neuron, but you use multiple of those neurons and you arrange them in layers. So here you can see a very simple draft. So, it is very similar to the perceptron. You have essentially some inputs and some weights. Now, you can see that it’s not just a single sum, but we have several of those sums that go through a non-linearity. Then, they assign weights again and summarize again to go into another non-linearity.

随着多层感知器的引入，情况发生了变化。现在这是感知器的扩展。您不仅要处理单个神经元，还要使用多个神经元，然后将它们分层排列。因此，在这里您可以看到一个非常简单的草稿。因此，它与感知器非常相似。您实际上具有一些输入和权重。现在，您可以看到它不仅是一个总和，而且其中有几个总和都是非线性的。然后，他们再次分配权重并再次汇总，以得出另一个非线性。

This is very interesting because we can use multiple neurons. We can now also model nonlinear decision boundaries. You can go on and then arrange this in layers. So what you typically do is, you have some input layer. This is our vector x. Then, you have several perceptrons that you arrange in hidden layers. They’re called hidden because they do not immediately observe the input. They assign weights, then compute something, and only at the very end, at the output, you have a layer again where you can observe what’s actually happening. All of these weights that are in between in those hidden layers, they are not directly observable. Here, you only observe them when you put some input in, compute the activations, and then at the very end, you can obtain the output. So, this is where you can actually observe what’s happening in your system.

这非常有趣，因为我们可以使用多个神经元。现在，我们还可以对非线性决策边界建模。您可以继续，然后按层次排列。因此，通常要做的是，有一些输入层。这是我们的向量x。然后，您将有几个感知器布置在隐藏层中。之所以称它们为“隐藏”，是因为它们没有立即观察到输入。他们分配权重，然后计算一些东西，直到最后，在输出端，您又有了一个图层，可以在其中观察实际发生的情况。这些隐藏层之间的所有权重都是无法直接观察到的。在这里，仅当您输入一些输入，计算激活量并最终获得输出时，您才能观察到它们。因此，您可以在这里实际观察系统中发生的情况。

Now, we will look into the so-called universal function approximator. This is actually just a network with a single hidden layer. Universal function approximation is a fundamental piece of theory because it tells us that with a single hidden layer, we can approximate any continuous function. So, let’s look a bit into this theorem. It starts as a formal definition. We have some 𝜑(x) and 𝜑(x) is a non-constant, bounded, monotonically increasing function. There exists some 𝜀 greater than zero and for any continuous function f(x) defined on a compact subset of some high dimensional space ℝᵐ there exists an integer and real constant 𝜈 and b, and the real vectors w, where you can find an approximation. Here, you now see how the approximation is computed. You have an inner product of the weights with the input plus some bias. This goes into some activation function 𝜑(x). This is a non-constant, bounded, and monotonically increasing function. Then you have another linear combination using those 𝜈 which then produce the output capital F(x). So F(x) is our approximation and the approximation is a linear combination of nonlinearities that are computed from linear combinations. If you define it this way, you can demonstrate that if you F(x) from the true function f(x), the absolute difference between the two is bounded by a constant 𝜀. 𝜀 is greater than zero.

现在，我们将研究所谓的通用函数逼近器。实际上，这只是一个具有单个隐藏层的网络。通用函数逼近是理论的基础，因为它告诉我们，只要有一个隐藏层，我们就可以逼近任何连续函数。因此，让我们看一下这个定理。它以正式定义开始。我们有一些𝜑(x)，𝜑(x)是一个非恒定，有界，单调递增的函数。存在一些大于零的and，并且对于在某些高维空间的紧子集上定义的任何连续函数f( x )，都存在一个整数，实常数𝜈和b以及实向量w ，您可以在其中找到一个近似值。在这里，您现在看到如何计算近似值。您有权重与输入的内积加上一些偏差。这进入了一些激活函数𝜑(x)。这是一个非恒定，有界且单调递增的函数。然后，您将使用using进行另一个线性组合，从而产生输出资本F( x )。因此，F( x )是我们的近似值，该近似值是根据线性组合计算出的非线性的线性组合。如果你把它定义这种方式，你可以证明，如果F(X)与真实函数f(x)，两者之间的绝对差由介电常数ε为界。 𝜀大于零。

That’s already a very useful approximation. There is an upper bound 𝜀, but right now it doesn’t tell us how large 𝜀 actually is. So, 𝜀 may be really large. The universal approximation theorem also tells us that if we increase N, then 𝜀 goes down. Now if you approach infinity with N, 𝜀 will approach zero. So, the more neurons we take in this hidden layer, the better our approximation will get. So this means, we can approximate any function with just one hidden layer. So you could argue if you can approximate everything with a single layer, why the hell are people doing deep learning?

这已经是一个非常有用的近似值。有一个上限but，但现在并不能告诉我们𝜀的实际大小。因此，𝜀可能真的很大。普适逼近定理还告诉我们，如果我们增加N，则𝜀下降。现在，如果您用N逼近无穷大，𝜀将接近零。因此，我们在此隐藏层中吸收的神经元越多，我们的近似值越好。因此，这意味着我们可以只用一个隐藏层来近似任何函数。因此，如果您可以用一个层来近似所有事物，您可能会争论，为什么人们在做深度学习呢？

Deep learning doesn’t make any sense if a single layer is enough. I’ve just proved this to you. So there’s maybe no need for deep learning? Let’s look into some examples: I took a classification tree here and a classification tree is a method of subdividing space. I’m taking a 2-D example here where we have some input space x₁ and x₂. This is useful because we can visualize it very efficiently here on the slides. Our decision tree does the following thing: It decides whether x₁ is greater than 0.5. Note that I’m showing you the decision boundary on the right. In the next node, if you go to the left-hand side you look at x₂ and decide whether it’s greater or smaller than 0.25. On the other side, you simply look at x₁ again and decide whether it’s greater or smaller than 0.75. Now, if you do that you can assign classes in the leaf nodes. In these leaves, you can now, for example, assign the value 0 or 1 and this gives a subdivision of this place that has the shape of a mirrored L.

如果仅一层就足够了，深度学习就没有任何意义。我已经向你证明了这一点。所以也许不需要深度学习？让我们看一些示例：我在这里采用了分类树，分类树是细分空间的一种方法。我在这里以二维示例为例，其中我们有一些输入空间x 1和x 2。这很有用，因为我们可以在幻灯片上非常有效地对其进行可视化。我们的决策树执行以下操作：决定x₁是否大于0.5。请注意，我在右边显示决策边界。在下一个节点中，如果转到左侧，则查看x 2并确定它是大于还是小于0.25。另一方面，您只需再次查看x₁并确定它是大于还是小于0.75。现在，如果您这样做，则可以在叶节点中分配类。在这些叶子中，例如，您现在可以分配值0或1，这将使该位置的细分为具有镜像L形状。

So, this is a function and this function can now be approximated by a universal function approximator. So let’s try to do that. We can transform this actually into a network. Let’s use the following idea: Our network has 2 input neurons because it’s a two-dimensional space. With our decision boundaries, we can also form these decisions x₁ being greater or smaller than 0.5. So, we can immediately adopt this. We can actually also adopt all the other inner nodes. Because we are using a sigmoid in this example, we also use the inverse of the inner nodes and put them in as additional neurons. So of course, I don’t have to learn anything here, because the connections towards the first hidden layer, I can take them from the tree definition. They’re already predefined, so there’s no learning required here. On the output side, I have to learn some weights and this can be done using, for example, a least square approximation and then I can directly compute those weights.

因此，这是一个函数，该函数现在可以由通用函数逼近器逼近。因此，让我们尝试这样做。我们实际上可以将其转换为网络。让我们使用以下想法：我们的网络有2个输入神经元，因为它是一个二维空间。利用我们的决策边界，我们还可以形成大于或小于0.5的这些决策。因此，我们可以立即采用此方法。实际上，我们也可以采用所有其他内部节点。由于在此示例中使用的是S型，因此我们还使用了内部节点的逆，并将它们放入其他神经元中。因此，我当然不需要在这里学习任何东西，因为与第一个隐藏层的连接可以从树定义中获取。它们已经预先定义，因此这里不需要学习。在输出端，我必须学习一些权重，这可以使用例如最小二乘近似来完成，然后我可以直接计算这些权重。

If I go ahead and really do that, we can also find a nice visualization. You can see that with our decision boundaries, we are essentially constructing a basis in the hidden layer. You can see if I use 0 and 1 as black and white, for every hidden node, I’m constructing a base vector. They are then essentially weighted linearly to for the output. So you could do this here by multiplying every pixel with every pixel and then simply summing this up. This is what the hidden layer here would do. Then, I’m essentially interested in combining those space vectors such that it will produce the desired y. Now, if I do that in a least-square sense, I get the approximation on the right. So it’s not half bad. I magnified this a bit. So this is what we wanted to get. This is the mirrored L and this is what came out of my approximation that I just proposed. Now, you can see that it kind of has the L shape in there, but the values here are in a domain between [0,1] and the 𝜀 with my six neuron approximation here is probably in the range of 0.7. So it kind of does the trick, but the approximation is not very good. In this particular configuration, you have to increase the number of neurons really a lot in order to get the error down because it’s a really hard problem. It can almost not be approximated.

如果我继续做下去，我们还可以找到很好的可视化效果。您可以看到，凭借我们的决策边界，我们实际上是在隐藏层中构建基础。您可以看到是否将0和1用作黑色和白色，对于每个隐藏节点，我都在构建基本向量。然后将它们基本上线性加权为输出。因此，您可以在这里通过将每个像素乘以每个像素，然后简单地将其相加来执行此操作。这就是隐藏层将要执行的操作。然后，我本质上对组合这些空间向量感兴趣，这样它将产生所需的y。现在，如果我以最小二乘的方式执行该操作，则可以得到右侧的近似值。所以这还不错。我把这个放大了一点。这就是我们想要得到的。这是镜像的L，这就是我刚刚提出的近似值。现在，您可以看到它在其中具有L形，但此处的值在[0,1]和𝜀之间，并且我的六个神经元近似值在0.7范围内。因此，它确实可以解决问题，但是近似值不是很好。在这种特殊的配置中，您必须真正增加神经元的数量才能降低错误，因为这是一个非常困难的问题。几乎不能近似。

So, what else could we do? Well if we want this, we could, for example, add a second non-linearity. Then, we would get exactly the solution that we desire. So you see maybe one layer is not very efficient in terms of representation. There is an algorithm that can map any decision tree on to a neural network. The algorithm goes as follows: You take all of your inner nodes, here the decisions between 0.5, 0.25, and 0.75. So, these are the inner nodes and then you connect them appropriately. You connect them in a way such that you are able to form exactly the sub-regions. Here you see that this is our L shape and in order to construct the top left region, we need to have access to the first decision. It separates the space into the left half-space and the right-half space. Next, we have access to the second decision. This way, we can use these two decisions in order to form this small patch on the top left. For all of the four patches that emerge from the decision boundaries, we get essentially one node. This simply means that for every leaf node, we get one node in the second layer. So one node for every inner node in the first layer and one node for every leaf node in the second layer. Then, you combine them in the output. You don’t even have to compute anything here, because we already know how these have to be merged in order to get to the right decision boundaries. This way, we manage to convert your decision tree into a neural network and it does exactly the correct approximation as we want it to happen.

那么，我们还能做什么？好吧，如果我们愿意，例如，可以添加第二个非线性。然后，我们将获得所需的解决方案。因此，您可能会发现一层在表示方面不是很有效。有一种算法可以将任何决策树映射到神经网络。该算法如下：取所有内部节点，此处的决策介于0.5、0.25和0.75之间。因此，这些是内部节点，然后您可以适当地连接它们。您以某种方式连接它们，以便能够准确地形成子区域。在这里，您将看到这是我们的L形，为了构造左上方区域，我们需要访问第一个决策。它将空间分成左半部分空间和右半部分空间。接下来，我们可以访问第二个决定。这样，我们可以使用这两个决定来在左上方形成这个小补丁。对于从决策边界出现的所有四个补丁，我们基本上得到一个节点。这只是意味着，对于每个叶节点，我们在第二层中都会得到一个节点。因此，第一层中每个内部节点都有一个节点，第二层中每个叶节点都有一个节点。然后，将它们合并到输出中。您甚至不必在这里进行任何计算，因为我们已经知道如何合并这些数据才能到达正确的决策边界。这样，我们设法将您的决策树转换为神经网络，并且它会按照我们希望的那样精确地执行正确的近似处理。

What do we learn from this example? Well, we can approximate any function with a universal function approximator with just one hidden layer. But if we go deeper, we may find a decomposition of the problem that is just way more efficient. So here the decomposition was first inner nodes, then leaf nodes. This enabled us to derive an algorithm that only has seven nodes and could exactly approximate the problem. So you could argue that by building deeper networks you add additional steps. In each step, you try to simplify the function and the power of the representation, such that you get better processing towards the decision in the end.

我们从这个例子中学到什么？好吧，我们可以使用只有一个隐藏层的通用函数逼近器来逼近任何函数。但是，如果我们深入研究，可能会发现问题的分解更加有效。因此，这里的分解首先是内部节点，然后是叶节点。这使我们能够推导只有七个节点的算法，并且可以精确地近似该问题。因此，您可能会争辩说，通过建立更深的网络，您需要添加其他步骤。在每个步骤中，您都尝试简化表示的功能和功能，以便最终更好地处理决策。

Now, let’s go back to our Universal function approximation theorem. So, we’ve seen that it exists. It tells us that we can approximate everything with just a single hidden layer. So, that’s already a pretty cool observation but it doesn’t tell us how to choose N. It doesn’t tell us how to train. So there are a lot of problems with the universal approximation theorem. This is essentially the reason, why we go to what’s deep learning. Then, we can build systems that start disentangling representation over various steps. If we do so, we can build more efficient and more powerful systems and train them end-to-end. So this is the main reason, why we go towards deep learning. I expect anybody who’s working in deep learning to know about universal approximation and why deep learning actually makes sense. Ok so, that’s it for today. Next time, we will talk about activation functions and we will start introducing the backpropagation algorithm in the next set of videos. So stay tuned! I hope you enjoyed this video. Looking forward to seeing you in the next one!

现在，让我们回到通用函数逼近定理。因此，我们已经看到它的存在。它告诉我们，仅用一个隐藏层就可以近似所有内容。因此，这已经是一个很酷的观察，但是它并没有告诉我们如何选择N。也没有告诉我们如何训练。因此，通用逼近定理存在很多问题。从本质上讲，这就是为什么我们要进行深度学习的原因。然后，我们可以构建可以在各个步骤上解开表示形式的系统。如果这样做，我们可以构建更高效，更强大的系统，并对其进行端到端培训。因此，这就是我们走向深度学习的主要原因。我希望从事深度学习的任何人都可以了解通用逼近，以及为什么深度学习实际上有意义。好的，就是今天。下次，我们将讨论激活功能，并将在下一组视频中介绍反向传播算法。敬请期待！希望您喜欢这个视频。期待在下一个见到您！

If you liked this post, you can find more essays here, more educational material on Machine Learning here, or have a look at our Deep Learning Lecture. I would also appreciate a clap or a follow on YouTube, Twitter, Facebook, or LinkedIn in case you want to be informed about more essays, videos, and research in the future. This article is released under the Creative Commons 4.0 Attribution License and can be reprinted and modified if referenced.

如果你喜欢这篇文章，你可以找到这里更多的文章，更多的教育材料，机器学习在这里，或看看我们的深入学习讲座。如果您想在以后了解更多文章，视频和研究信息，也欢迎在YouTube ， Twitter ， Facebook或LinkedIn上进行拍手或追随。本文是根据知识共享4.0署名许可发布的，如果引用，可以重新打印和修改。