应力的推导过程_无应力高斯过程

最新推荐文章于 2024-08-07 09:14:10 发布

weixin_26726011

最新推荐文章于 2024-08-07 09:14:10 发布

阅读量453

点赞数

文章标签： java python 算法机器学习

原文链接：https://towardsdatascience.com/no-stress-gaussian-processes-40e238597864

版权

应力的推导过程

When first people get introduced to Gaussian Processes, they would hear something like “Gaussian Processes allow you to work with an infinite space of functions in regression tasks”. This is quite a hard thing to process. In fact, Gaussian Processes are very simple in a nutshell and it all starts with the (multivariate) normal (Gaussian) distribution that has certain nice properties that we can use for GPs . As a rule of thumb for this article, I would suggest when in doubt, take a look at the GP equations, then the text should become clear(er).

当第一人介绍高斯过程时 ，他们会听到类似“高斯过程允许您在回归任务中使用无限功能空间”的信息。这是一件很难的事情。实际上，高斯过程简而言之非常简单，所有过程都始于(多变量)正态(Gaussian)分布，该分布具有某些可用于GP的良好特性。作为本文的经验法则，我建议当有疑问时，请看一下GP方程，然后使文本变得清晰。

What are Gaussian Processes good for? Well,from a theory perspective they are universal function approximators, so the answer would be for anything where you need to fit some kind of function. But there are some computational considerations to be made to which we will get later that influence the decision of wether to use them or not, and also the availability of data.

高斯过程有什么好处？ 嗯，从理论上讲，它们是通用函数逼近器，因此答案将是针对需要适合某种函数的任何事物。但是有一些计算上的考虑 ，我们稍后会考虑到这些因素，这些因素会影响是否使用它们的决定以及数据的可用性。

Where do Gaussian Processes fit in the big picture of Machine Learning? First of all, GPs are non-parametric models, meaning that we don’t have a certain update rule for some kind of model parameters based on training data. Another example of a non-parametric algorithm is k-Nearest Neighbors. So all in all, here (in the most basic case of GPs) we have no gradients, no objective function that we optimize over directly.

高斯过程在机器学习的大环境中适合什么位置？ 首先，GP是非参数模型，这意味着我们没有基于训练数据的某些模型参数的特定更新规则。非参数算法的另一个示例是k最近邻居。因此，总的来说，这里(在GP最基本的情况下)没有梯度，也没有直接优化的目标函数。

Second of all, GPs yield themselves nicely to the Bayesian perspective in machine learning (I advise when seeing Bayesian, think quantifying uncertainty in prediction, or amount of information).

其次，GP在机器学习中很好地符合贝叶斯的观点(我建议在看到贝叶斯时建议考虑量化预测的不确定性或信息量)。

So before we jump into GPs, we need to cover these nice properties of Gaussian distributions. As you will see later, GPs don’t bring much more to the table except using well known results for multivariate Gaussian distributions. For simplicity, let’s just look at the bivariate case of the Gaussian which is defined with a mean vector μ and covariance matrix Σ:

因此，在进入GP之前，我们需要涵盖高斯分布的这些优良特性。正如您将在后面看到的，GP除了为多变量高斯分布使用众所周知的结果外，并没有带来太多其他好处。为了简单起见，让我们看一下用均值向量μ和协方差矩阵Σ定义的高斯的二元情况：

Just to get a feeling for it, the plot below shows the density of the bivariate Gaussian distribution centered at 0 (zero mean). Try to figure out how does the covariance matrix look like for the different Gaussians.

只是为了感受一下，下图显示了以0(均值为零)为中心的二元高斯分布的密度。尝试找出不同高斯函数的协方差矩阵的样子。

The shuffling property. We can take the mean vector and covariance matrix and shuffle them respectively such that the respective covariances and means belong together. The new distribution is same as the old one, it doesn’t change.

改组属性。 我们可以取均值向量和协方差矩阵，并分别对其进行混洗，以使各自的协方差和均值属于同一方。新的发行版与旧的发行版相同，它没有变化。

The marginal property. The multivariate Gaussian distribution defines a joint distribution over a set of variables. Each of these variables are are also distributed according to a Gaussian distribution, in fact we can read out the covariance of this distribution from the covariance matrix, as an example, we can take the 1st dimension of the mean vector, μ₁ and the belonging covariance matrix would be Σ₁₁. The marginal Gaussian is then N(μ₁, Σ₁₁).

边际财产。 多元高斯分布定义了一组变量的联合分布。这些变量中的每一个也都根据高斯分布进行分布，实际上，我们可以从协方差矩阵中读出该分布的协方差，例如，我们可以取均值向量的第一维，μ1和所属协方差矩阵将是Σ₁₁。边际高斯为N(μ₁，Σ₁₁)。

The conditioning property. As we state earlier, the multivariate Gaussian defines a joint distribution over variables. As a matter of fact, conditioning on one of those variables also yields a Gaussian distribution. Furthermore, calculating the covariance and the mean of the resulting distribution is simple and possible in closed form. This is an essential property that we will make use of in GPs.

条件属性。 如前所述，多元高斯定义了变量的联合分布。实际上，对这些变量之一进行条件调整也会产生高斯分布。此外，计算协方差和所得分布的平均值很简单，并且可以采用封闭形式。这是我们将在GP中使用的基本属性。

从多元高斯到推论 (From Multivariate Gaussians to Inference)

Now that we’ve covered the properties of Gaussians, we can talk about how GPs work. First of all, let us remember that we have a general task of regression, we have a train dataset of (x, y) pairs. As I have stated earlier:

既然我们已经介绍了高斯的属性，我们就可以讨论GP的工作原理。首先，让我们记住，我们有一个回归的一般任务，我们有一个( x，y )对的训练数据集。如前所述：

A Gaussian Process is completely specified by the mean function μ and the kernel function k

高斯过程完全由均值函数μ和内核函数k指定

The following equation just says that the probability distribution of the function f is distributed according to the GP:

以下方程式仅表示函数f的概率分布是根据GP分布的：

The kernel function is going to be a measure of similarity of data points, or better to say k(x₁, x₂) tells us how close are y₁ and y₂. In essence, by specifying the mean function and the kernel function we are specifying the prior distribution over an infinite number of functions. We can think of this GP distribution as a placeholder for a multivariate normal, because in the end, we will be just dealing with multivariate normal distributions, but the way that we calculate the mean and the covariance of this multivariate normal distribution is the beauty of Gaussian Processes. This may be clearer when we expand the previous equation a bit:

核函数将用来度量数据点的相似性，或者更好的说k( x₁ ， x 2)告诉我们y₁和y 2有多近。本质上，通过指定均值函数和核函数，我们可以指定无限数量的函数的先验分布。我们可以将GP分布视为多元正态的占位符，因为最后，我们将只处理多元正态分布，但是我们计算该多元正态分布的均值和协方差的方式就是高斯过程。当我们稍微扩展前面的等式时，这可能会更清楚：

We can see that in order to calculate the mean vector and covariance matrix of this multivariate normal distribution, we just need to apply the mean function to each point in the train set and the kernel function to all pairs of points in the train set, which yields the covariance matrix. It is worth to note here that we obviously cannot use any function as a kernel function, the covariance matrix has to be positive definite, so the kernel function needs to be chosen accordingly.

我们可以看到，为了计算该多元正态分布的均值向量和协方差矩阵，我们只需要将均值函数应用于训练集中的每个点，并将核函数应用于训练集中的所有成对点，产生协方差矩阵。这里值得指出的是，我们显然不能将任何函数用作内核函数，协方差矩阵必须是正定的，因此需要相应地选择内核函数。

I would like to turn your attention now again to the sentence “distribution over functions”. Why do we see it like this? In order to make this more clear, look at the figure below. The x axis defines the input variables and the y axis the function value. In blue, we see the different realizations of the function for a specific value of x. A red line specifies a sampled realization of the function over x. How would we perform such a sampling? It is pretty clear from the definition of the GP. We can just sample from the joint distribution over variables a certain set of Y, that is f(X). We said that the marginal distribution of a multivariate Gaussian is also Gaussian, this is the reason that we get multiple values f(x) from the same x, because x is normally distributed.

我现在再次请您注意“ 在功能上分配 ”一词。为什么我们会这样看？为了使这一点更清楚，请看下图。 x轴定义输入变量，y轴定义函数值。以蓝色表示，对于特定的x值，我们看到函数的不同实现。红线指定x上函数的采样实现。我们将如何进行抽样？从GP的定义很明显。我们可以从变量的联合分布中采样一组特定的Y ，即f(X) 。我们说多元高斯分布的边际分布也是高斯分布，这就是我们从同一个x获得多个值f(x)的原因，因为x是正态分布的。

This is how it would look like if we had more data, more xs, again, it is just sampling from a joint distribution defined by the multivariate Gaussian.

如果我们有更多的数据，更多的xs ，这就是它的样子，同样，它只是从多元高斯定义的联合分布中采样。

But of course, we didn’t do any proper inference yet, did we? We’ve just seen that we can sample an infinite number of functions, and we want to actually predict a value of a function at an unknown point x. This is exactly what we achieve by conditioning, let us assume that our train dataset contains the variables x₂ and x₄. And we want to predict the function value at x₁ and x₃. The logical thing to do is to condition on the known values. By conditioning we reduce the uncertainty (possible values of function f) at the points that are close. What “close” actually means is defined by the kernel function.

但是，当然，我们还没有进行任何适当的推断，对吧？我们刚刚看到可以对无限数量的函数进行采样，并且我们实际上希望在未知点x上预测函数的值。这正是我们通过调节实现，让我们假设我们的训练集包含的变量X 2和X₄。 而且我们要预测x₁和x₃处的函数值。逻辑上的事情是以已知值为条件。通过条件调整，我们减小了接近点的不确定性(函数f的可能值)。 “关闭”的实际含义是由内核函数定义的。

Or shown with a bit more functions (by saying more, there is not a fixed number of functions, we just yank N functions from the infinite set of functions):

或显示了更多的功能(多说了一点，没有固定数量的功能，我们只是从无限的功能集中提取N个功能)：

You might be wondering: “But wait, we are conditioning on some kind of values f(x), but f(x) doesn’t even appear in the definition of the distribution”. I admit, when talking about GPs, this might be confusing. But in fact, f(x), that is y, does appear. It appears in the closed form solutions to the mean and covariance of the conditional distribution and the prediction points. The equations look like the following (in the case that our mean function μ is 0):

您可能想知道：“但是等等，我们以某种值f(x)为条件，但是f(x)甚至没有出现在分布的定义中”。我承认，在谈论GP时，这可能会造成混淆。但实际上，确实出现了f(x) ，即y 。它以封闭形式显示条件分布和预测点的均值和协方差。等式如下所示(在我们的平均函数μ为0的情况下)：

As you can see, y is essential to calculating the new mean, which is basically the point that we are most confident about in our prediction. This is the case where we do not assume noisy data. But what if our data is in fact noisy? Such cases are pretty standard when working with robotics applications, we often assume that the noise is normally distributed with mean 0. The change to the conditional equations is minimal, as you can see we just add diagonal variance to the design matrix. Intuitively, what does this mean? It means that we are increasing the uncertainty of our marginals, which are exactly our training data points, and this is what we want to do in the presence of noise.

如您所见， y对于计算新的均值至关重要，这基本上是我们对预测最有信心的一点。在这种情况下，我们不假设数据嘈杂。但是，如果我们的数据实际上嘈杂怎么办？在使用机器人应用程序时，这种情况非常标准，我们通常会假设噪声的正态分布为均值0。对条件方程的更改很小，因为您可以看到，我们仅向设计矩阵添加了对角线方差。直观上，这是什么意思？这意味着我们正在增加边际的不确定性，而这些不确定性正是我们的训练数据点，而这正是我们在存在噪声的情况下想要做的。

Now we can plot the conditional distribution over functions after we did this small bit of linear algebra:

现在，在完成了这一小部分线性代数后，我们可以在函数上绘制条件分布：

其他注意事项和趣闻 (Additional Considerations and Fun Facts)

Here are perhaps some additional things that are useful to think about.

这也许是一些值得考虑的其他事项。

Calculating the inverse of the covariance. Calculating the inverse of the covariance matrix is known to be numerically unstable, especially in the case with large datasets. Note that the computational complexity of inference scales with the dataset size in Gaussian Processes, which is not the case in parametric models. So the method that you choose to calculate the inverse is essential for performance, one of the most used methods incorporates the Cholesky decomposition.

计算协方差的倒数。 众所周知，计算协方差矩阵的逆值在数值上是不稳定的，尤其是在具有大型数据集的情况下。请注意，高斯过程中推理的计算复杂度随数据集大小而定，而在参数模型中则并非如此。因此，您选择的计算逆函数的方法对于性能至关重要。最常用的方法之一是合并Cholesky分解。

What is the computational complexity of inference? The hard truth is that increasing training data size increases computational complexity of inference. To alleviate this, some people work on sparse Gaussian Processes that basically choose in a smart way on which part of the matrix K to condition on. Intuitively, we don’t need to use all of the training points to make a prediction, we need to use just the ones that are “close”. By reducing the size of the covariance matrix used in the conditioning, we reduce computational complexity.

推理的计算复杂度是多少？ 硬道理是，增加训练数据的大小会增加推理的计算复杂性。为了减轻这种情况，一些人致力于稀疏的高斯过程，该过程基本上以一种聪明的方式选择了要对矩阵K的哪一部分进行条件调整。直观上，我们不需要使用所有训练点来进行预测，我们只需要使用“接近”的训练点即可。通过减少条件中使用的协方差矩阵的大小，我们减少了计算复杂度。

Should I use Gaussian Processes or Deep Neural Networks? This is a strongly debated question. Of course, bringing out the point of rising computational complexity of inference with dataset size could be an argument for neural networks, but there is work showing that Gaussian Processes outperform neural nets greatly on a multitude of tasks. There exists also theoretical work showing that a very “large” neural net is basically a GP.

我应该使用高斯过程还是深度神经网络？ 这是一个激烈争论的问题。当然，指出数据集大小的推理计算复杂性上升的观点可能是神经网络的一个论点，但是有工作表明，高斯过程在许多任务上的表现都大大优于神经网络。也有理论研究表明非常“大”的神经网络基本上就是GP。

Do Gaussian Processes have the problem of overfitting? Not necessarily. The reason is that we have a measure of uncertainty in our prediction and it highly depends on the choice of the kernel function for the design matrix.

高斯过程是否存在过度拟合的问题？ 不必要。原因是我们在预测中具有不确定性的度量，并且高度依赖于设计矩阵的核函数的选择。

What’s up with these kernels anyway? I wrote a blog post on the topic that (might) be informative. In short, kernels are a measure of similarity between two things, or distance metric if you will. It is a bit of an unfortunate name since the term “kernel” is used in different contexts in ML, such as “convolutional kernel” which has nothing to do with kernel functions, or the kernel trick.

这些内核到底是怎么回事？ 我写了一篇博客文章，题目是(可能)提供信息。简而言之，内核可以衡量两件事之间的相似性，如果可以的话，也可以作为距离度量。这是一个不幸的名字，因为术语“内核”在ML中的不同上下文中使用，例如与内核功能或内核技巧无关的“卷积内核”。

Where I have data, I am certain, but where I don’t, I am uncertain, isn’t that just stupid? No, it’s not just “where you have data”, the amount of “certainty” highly depends on the choice of the kernel function and what means to be “close”. But yea, it is as stupid as the kernel function chosen for the problem at hand.

我敢肯定，在哪里有数据，但是不确定，在哪里我不只是愚蠢的？ 不，这不仅是“拥有数据的地方”，“确定性”的程度在很大程度上取决于内核功能的选择以及“接近”的含义。但是，是的，它和为当前问题选择的内核功能一样愚蠢。

一些有用的链接 (Some Useful Links)

Gaussian Process Summer School

高斯过程暑期学校

Gaussian Process Playground

高斯过程游乐场

Gaussian Process Book

高斯过程书

My blog post about kernels

我关于内核的博客文章

Thanks!

谢谢！

翻译自: https://towardsdatascience.com/no-stress-gaussian-processes-40e238597864

应力的推导过程

weixin_26726011

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
应力的推导过程_无应力高斯过程

应力的推导过程When first people get introduced to Gaussian Processes, they would hear something like “Gaussian Processes allow you to work with an infinite space of functions in regression tasks”. This is qu...
复制链接

扫一扫