概率推理_概率推理的基本问题

最新推荐文章于 2022-10-22 15:17:56 发布

weixin_26704853

最新推荐文章于 2022-10-22 15:17:56 发布

阅读量872

点赞数

文章标签： python 机器学习人工智能 java 算法

原文链接：https://towardsdatascience.com/fundamental-problems-of-probabilistic-inference-b46be1f96127

版权

本文深入探讨概率推理的基本概念，结合机器学习和人工智能领域的应用，解析概率推理在实际问题中可能遇到的挑战和解决方案。

摘要由CSDN通过智能技术生成

概率推理

By now, I know many people that do research in ML or play around with machine learning algorithms. Yet, most of them somehow don’t appreciate the fundamental problems that machine learning is built upon, the problems of probabilistic inference. The point of this article is to maybe turn your attention to questions that you might not have considered when coding machine learning algorithms.

到目前为止，我知道许多从事ML研究或使用机器学习算法的人。但是，他们中的大多数人都不知道机器学习所基于的基本问题，即概率推理问题。本文的重点是使您的注意力转向编码机器学习算法时可能没有考虑的问题。

Why do we talk about probability in the first place? Where does the randomness come from? Is there such a thing as random variables really? In the end, we want to predict something relatively concrete, a class label given an image, an optimal action given some kind of state description in a Markov Decision Process. Arguably, there is nothing random about these things. An object is not really with some probability assigned a class label, in a random sense. A cow is not maybe a cow, it is certainly a cow.

为什么我们首先谈到概率？随机性从何而来？真的有随机变量吗？最后，我们要预测相对具体的事物，给定图像的类标签，给定马尔可夫决策过程中某种状态描述的最佳动作。可以说，这些事情没有什么随机的。从随机意义上讲，实际上不是真的以某种可能性为对象分配了类别标签。牛也许不是牛，它当然是牛。

Image for post — Photo by Jean Carlo Emer on Unsplash

On the other hand, we have problems such as different flavors of unsupervised learning, where we might want to reduce dimensionality of the data, cluster the data, learn a generative model that reflects the probability distribution of the data. All of these flavors can be expressed in terms of probability. But again, assigning a latent low-dimensional representation to a data point is not truly random. We want to map directly an input data point to a latent representation (a cluster, a latent variable of lower dimension).

另一方面，我们遇到了诸如无监督学习的不同口味之类的问题，我们可能希望降低数据的维数，对数据进行聚类，学习反映数据概率分布的生成模型。所有这些风味都可以用概率表示。但是同样，将潜在的低维表示分配给数据点并不是真正随机的。我们想直接将输入数据点映射到潜在表示(一个簇，一个较低维度的潜在变量)。

So, why do we use randomness and probabilities in the end? The main argument is that there is no randomness. What we actually talk about when we talk about probability distributions, densities in the machine learning world. We actually talk about information or uncertainty. A probability measure reflects how much information do we have about a given event. Let’s look at a supervised example, an image containing a cow. When I say, the probability of the image containing a cow is 0.9. This doesn’t mean that the cow is there sometimes, it means that I am 90% certain that a cow is in this image. Perhaps because the image doesn’t have enough information to determine fully that it contains a cow, or perhaps my model is wrong… Who knows.

那么，为什么最后要使用随机性和概率呢？主要论点是没有随机性。 当我们谈论机器学习世界中的概率分布和密度时，我们实际上在谈论什么。 我们实际上是在谈论信息或不确定性。 概率度量反映了我们对给定事件拥有多少信息。让我们看一个受监管的示例，其中包含一头母牛的图像。当我说时，包含牛的图像的概率为0.9。这并不意味着有时候牛在那儿，而是意味着我有90％的把握确定这幅图中有牛。也许是因为该图像没有足够的信息来完全确定它包含一头母牛，或者我的模型是错误的……谁知道。

Where does the randomness come in? The punchline is that there is no randomness. Let’s talk about prior distributions for a moment. What does it mean that I have a prior distribution over my model parameters, or in other words hypothesis space? That means that I am more certain about particular configurations, particular hypotheses than others. It doesn’t mean that the model is random, that it can change based on an underlying stochastic process.

随机性从何而来？ 最重要的是，没有随机性。让我们先讨论一下先前的分布。我对模型参数或假设空间有先验分布是什么意思？这意味着我比其他人对特定配置，特定假设更有把握。这并不意味着该模型是随机的，它可以根据潜在的随机过程进行更改。

Our goal when we do machine learning is mostly (in different shapes and forms) the following, stated in the most general sense. Given some kind of data, we want to infer some parameters of the model that best describes the data. If you want to be Bayesian, then you will adhere to these concepts of uncertainty. The following is the Bayes rule given data D to infer the model parameters θ.

从最一般的意义上讲，我们进行机器学习时的目标主要是以下(以不同的形式和形式)。给定某种数据，我们想推断出最能描述数据的模型参数。如果您想成为贝叶斯，那么您将坚持这些不确定性的概念。以下是给定数据D以推断模型参数θ的贝叶斯规则。

p(θ) we name the prior, p(D|θ) the likelihood, and the LHS (left-hand side) is the posterior. The posterior simply expresses how certain we are about the parameters. In the denominator is the marginal over D. Note that we don’t magically have access to this marginal, we need to somehow evaluate it. To come to an integral formulation of the denominator we can marginalize out the θ from the joint distribution of θ and D. In that case, we have the following formulation of the Bayes rule:

p(θ)为先验概率， p(D |θ)为似然度，LHS(左侧)为后验概率。后验仅表示我们对参数的确定性。分母中是D上的边际。请注意，我们无法神奇地访问此边际，我们需要以某种方式对其进行评估。为了得出分母的积分公式，我们可以从θ和D的联合分布中边缘化θ 。在这种情况下，我们对贝叶斯规则有以下表述：

Equivalently, we can separate the term in the denominator in the integral such that it contains the likelihood and the prior over θ.

等效地，我们可以将分母中的项分离为整数，以使其包含似然和θ的先验。

Why is this a hard problem? ML practitioners say that evaluating this integral is hard, but it may be sometimes tricky to understand why it is hard. First of all, the θ in the denominator is not the same one as in the numerator. The integral basically means that we sum the joint distribution over all possible parameters θ. Now, take that this θ are the parameters of a neural network, and they are real numbers. There exist infinite configurations of parameters to evaluate in the denominator. This is clearly intractable. Even if we consider the real numbers to be limited to 32-bit precision, the number of possible configurations is exponential in the number of bits used for storing the model. Furthermore, the process of evaluating the integral involves sampling, clearly, the better our sampling, the better we can expect our estimate to be.

为什么这是一个难题？ 机器学习从业者说，评估这个积分很困难，但是有时理解它为什么很困难可能有些棘手。首先，分母中的θ与分子中的θ不同。积分基本上意味着我们对所有可能参数θ上的关节分布求和。现在，假设这个θ是神经网络的参数，并且它们是实数。分母中存在要评估的参数的无限配置。这显然是棘手的。即使我们认为实数仅限于32位精度，可能的配置数也与用于存储模型的位数成指数关系。此外，评估积分的过程涉及抽样，显然，抽样越好，我们可以期望的估计就越好。

As you read on, it is useful to keep the following in mind:

在继续阅读时，记住以下几点很有用：

Computational complexity is important in the case of estimation and sampling.

在估计和采样的情况下，计算复杂度很重要。

Nevertheless, the situation is not hopeless. Although some integrals cannot be evaluated, we can still estimate them. There are different ways of estimating an integral, one of which is (the most basic one) Monte Carlo Sampling (notice the Sampling part of the name). So what we can do is to draw a certain number of parameters θ and evaluate the function within the integral (note that we assume that we have access to the function to evaluate it), which looks like this:

然而，情况并非没有希望。尽管某些积分无法评估，但我们仍然可以估算它们。估计积分有不同的方法，其中一种(最基本的方法)是蒙特卡洛采样(请注意名称的采样部分)。因此，我们可以做的是绘制一定数量的参数θ并在积分中求值函数(请注意，我们假设我们可以访问要求值的函数)，如下所示：

This is a relatively simple estimator and it is unbiased, meaning that it is correct and yields the exact correct result with an infinite number of samples. However, it does come with a drawback.

这是一个相对简单的估计量，并且是无偏的，这意味着它是正确的，并且可以使用无限数量的样本得出准确的正确结果。但是，它确实有一个缺点。

It can be shown that for the Monte-Carlo estimator the error of the estimate drops asymptotically with the square root of the number of samples.

可以证明，对于蒙特卡洛估计器而言，估计值的误差随着样本数量的平方根逐渐减小。

This means that, if we want a good ball-park estimate of the target value, it might be good enough. But if we want a high precision estimate, there exist perhaps better options with faster convergence properties. Additionally, we might not like sampling from the distribution of the parameters that often, since sampling may be expensive.

这意味着，如果我们想要对目标值进行合理的估算，可能就足够了。但是，如果我们要进行高精度估算，则可能存在具有更快收敛性的更好选择。另外，由于采样可能很昂贵，因此我们可能不喜欢经常从参数分布中进行采样。

从分布抽样的计算考虑 (Computational Considerations of Sampling from Distributions)

Till now, we have talked about the problem of evaluating the denominator in the Bayes rule to calculate the posterior, . In the Monte-Carlo estimate, we need to draw samples of θ in order to estimate the integral. But how does the distribution over θ exactly look like? Have you ever wondered how does sampling actually work, how do we arrive at a “random” number when using a Python or R library? Even if I give you a sampler capable of sampling from a uniform distribution, how would you use it in order to sample from more complicated distributions? As it turns out, the problem of sampling is a fascinating problem in itself, both computationally and mathematically.

到目前为止，我们已经讨论了在贝叶斯规则中评估分母以计算后验的问题。在蒙特卡洛估计中，我们需要绘制θ样本以估计积分。但是， θ上的分布到底如何？您是否曾经想过采样实际上是如何工作的 ，使用Python或R库时我们如何得出一个“随机”数？即使我为您提供了能够从均匀分布中进行采样的采样器，您将如何使用它来从更复杂的分布中进行采样？事实证明，无论是在计算上还是数学上，抽样问题本身就是一个令人着迷的问题。

The main question that we want to ask ourselves when sampling is the following. If we are given the possibility of efficiently sampling from a simpler probability distribution, how can we sample from a more complicated probability distribution? As it turns out, sampling from complicated distributions is a non-trivial problem, and forms the basis of many approaches in ML. To name a few: offline reinforcement learning, normalizing flows, variational inference … But this discussion I leave for a later article.

下面是我们在采样时想问自己的主要问题。如果可以从更简单的概率分布中进行有效采样，那么如何从更复杂的概率分布中进行采样呢？事实证明，从复杂分布中采样是一个不小的问题，它构成了机器学习中许多方法的基础。仅举几例：离线强化学习，规范化流程，变分推论…但是我将在后面的文章中讨论这个问题。