03-Probabilities, Gaussians, and Bayes‘ Theorem 概率、高斯和贝叶斯理论-(一)

03-Probabilities, Gaussians, and Bayes’ Theorem 概率、高斯和贝叶斯理论

本系列其它文章

02-Discrete Bayes Filter 离散贝叶斯滤波_hyisoe的博客-CSDN博客


本文翻译自 ISOEhy/Kalman-and-Bayesian-Filters-in-Python. (github.com),翻译采用的是google机翻,笔者翻译时对一些明显错误的翻译进行了更改,但是还出存在很多比较别的地方,好在保留了原文,两者结合起来可以帮助英文不太好的同学更好的理解文章的内容。部分翻译不算准确,可结合英文原文阅读。

这个系列的文章浅入深出,是学习kalman滤波非常好的参考资源,英文好的可以直接看英文,写得非常易于理解。笔者也会抽时间将这个系列更完。

%matplotlib inline
#format the book
import book_format
book_format.set_style()

Introduction

The last chapter ended by discussing some of the drawbacks of the Discrete Bayesian filter. For many tracking and filtering problems our desire is to have a filter that is unimodal and continuous. That is, we want to model our system using floating point math (continuous) and to have only one belief represented (unimodal). For example, we want to say an aircraft is at (12.34, -95.54, 2389.5) where that is latitude, longitude, and altitude. We do not want our filter to tell us “it might be at (1.65, -78.01, 2100.45) or it might be at (34.36, -98.23, 2543.79).” That doesn’t match our physical intuition of how the world works, and as we discussed, it can be prohibitively expensive to compute the multimodal case. And, of course, multiple position estimates makes navigating impossible.

We desire a unimodal, continuous way to represent probabilities that models how the real world works, and that is computationally efficient to calculate. Gaussian distributions provide all of these features.

简介

上一章以讨论离散贝叶斯过滤器的一些缺点作为结束。对于许多跟踪和过滤问题,我们希望有一个单峰连续的过滤器。也就是说,我们希望使用浮点数学(连续)对我们的系统进行建模,并且只表示一个置信度(单峰)。例如,我们想说一架飞机位于 (12.34, -95.54, 2389.5),即纬度、经度和高度。我们不希望我们的过滤器告诉我们“它可能在 (1.65, -78.01, 2100.45) 或者它可能在 (34.36, -98.23, 2543.79)”。这与我们对世界如何运作的物理直觉不符,而且正如我们所讨论的,计算多模态案例的成本可能高得令人望而却步。当然,多个位置估计使得导航变得不可能。

我们需要一种单峰的、连续的方式来表示概率,模拟现实世界的运作方式,并且计算效率高。高斯分布提供了所有这些特征。

Mean, Variance, and Standard Deviations

Most of you will have had exposure to statistics, but allow me to cover this material anyway. I ask that you read the material even if you are sure you know it well. I ask for two reasons. First, I want to be sure that we are using terms in the same way. Second, I strive to form an intuitive understanding of statistics that will serve you well in later chapters. It’s easy to go through a stats course and only remember the formulas and calculations, and perhaps be fuzzy on the implications of what you have learned.

你们中的大多数人都接触过统计数据,但无论如何请允许我介绍这些材料。我要求你阅读这些材料,即使你确定你很了解它。我问有两个原因。首先,我想确保我们以相同的方式使用术语。其次,我努力形成对统计的直观理解,这将在后面的章节中很好地为您服务。完成统计课程很容易只记住公式和计算,并且可能对所学内容的含义感到模糊。

Random Variables

Each time you roll a die the outcome will be between 1 and 6. If we rolled a fair die a million times we’d expect to get a one 1/6 of the time. Thus we say the probability, or odds of the outcome 1 is 1/6. Likewise, if I asked you the chance of 1 being the result of the next roll you’d reply 1/6.

This combination of values and associated probabilities is called a random variable. Here random does not mean the process is nondeterministic, only that we lack information about the outcome. The result of a die toss is deterministic, but we lack enough information to compute the result. We don’t know what will happen, except probabilistically.

While we are defining terms, the range of values is called the sample space. For a die the sample space is {1, 2, 3, 4, 5, 6}. For a coin the sample space is {H, T}. Space is a mathematical term which means a set with structure. The sample space for the die is a subset of the natural numbers in the range of 1 to 6.

Another example of a random variable is the heights of students in a university. Here the sample space is a range of values in the real numbers between two limits defined by biology.

Random variables such as coin tosses and die rolls are discrete random variables. This means their sample space is represented by either a finite number of values or a countably infinite number of values such as the natural numbers. Heights of humans are called continuous random variables since they can take on any real value between two limits.

Do not confuse the measurement of the random variable with the actual value. If we can only measure the height of a person to 0.1 meters we would only record values from 0.1, 0.2, 0.3…2.7, yielding 27 discrete choices. Nonetheless a person’s height can vary between any arbitrary real value between those ranges, and so height is a continuous random variable.

In statistics capital letters are used for random variables, usually from the latter half of the alphabet. So, we might say that X X X is the random variable representing the die toss, or Y Y Y are the heights of the students in the freshmen poetry class. Later chapters use linear algebra to solve these problems, and so there we will follow the convention of using lower case for vectors, and upper case for matrices. Unfortunately these conventions clash, and you will have to determine which an author is using from context. I always use bold symbols for vectors and matrices, which helps distinguish between the two.

每次掷骰子时,结果 将介于 1 和 6 之间。如果我们掷一百万次公平骰子,我们预计会得到 1/6 的机会。因此我们说结果 1 的概率赔率是 1/6。同样,如果我问你下一次掷骰结果为 1 的概率,你会回答 1/6。这种值和相关概率的组合称为 随机变量。这里随机并不意味着过程是不确定的,只是我们缺乏关于结果的信息。掷骰子的结果是确定的,但我们缺乏足够的信息来计算结果。除了概率,我们不知道会发生什么。

当我们定义术语时,值的范围称为 样本空间。对于模具,样本空间为 {1, 2, 3, 4, 5, 6}。对于硬币,样本空间是 {H, T}。 空间是一个数学术语,表示具有结构的集合。骰子的样本空间是 1 到 6 范围内的自然数的子集。

随机变量的另一个例子是大学学生的身高。这里的样本空间是生物学定义的两个极限之间的实数值范围。

抛硬币和掷骰子等随机变量是离散随机变量。这意味着它们的样本空间由有限数量的值或可数无限数量的值(例如自然数)表示。人类的身高被称为连续随机变量,因为它们可以取两个极限之间的任何实际值。

不要将随机变量的测量与实际值混淆。如果我们只能将一个人的身高测量到 0.1 米,我们只会记录 0.1、0.2、0.3…2.7 的值,从而产生 27 个离散的选择。尽管如此,一个人的身高可以在这些范围内的任意实数值之间变化,因此身高是一个连续的随机变量。在统计学中,大写字母用于随机变量,通常来自字母表的后半部分。所以,我们可以说 X X X 是代表掷骰子的随机变量,或者 Y Y Y 是新生诗歌班学生的身高。后面的章节使用线性代数来解决这些问题,因此我们将遵循对向量使用小写字母,对矩阵使用大写字母的约定。不幸的是,这些约定发生冲突,您将不得不根据上下文确定作者使用的是哪一个。我总是对向量和矩阵使用粗体符号,这有助于区分两者。

Probability Distribution

The probability distribution gives the probability for the random variable to take any value in a sample space. For example, for a fair six sided die we might say:

概率分布 给出了随机变量在样本空间中取任意值的概率。例如,对于一个公平的六面骰子,我们可能会说:

ValueProbability
11/6
21/6
31/6
41/6
51/6
61/6

We denote this distribution with a lower case p: p ( x ) p(x) p(x). Using ordinary function notation, we would write:

P ( X = 4 ) = p ( 4 ) = 1 6 P(X{=}4) = p(4) = \frac{1}{6} P(X=4)=p(4)=61

This states that the probability of the die landing on 4 is 1 6 \frac{1}{6} 61. P ( X = x k ) P(X{=}x_k) P(X=xk) is notation for “the probability of X X X being x k x_k xk”. Note the subtle notational difference. The capital P P P denotes the probability of a single event, and the lower case p p p is the probability distribution function. This can lead you astray if you are not observent. Some texts use P r Pr Pr instead of P P P to ameliorate this.

Another example is a fair coin. It has the sample space {H, T}. The coin is fair, so the probability for heads (H) is 50%, and the probability for tails (T) is 50%. We write this as

P ( X = H ) = 0.5 P ( X = T ) = 0.5 \begin{gathered}P(X{=}H) = 0.5\\P(X{=}T)=0.5\end{gathered} P(X=H)=0.5P(X=T)=0.5

Sample spaces are not unique. One sample space for a die is {1, 2, 3, 4, 5, 6}. Another valid sample space would be {even, odd}. Another might be {dots in all corners, not dots in all corners}. A sample space is valid so long as it covers all possibilities, and any single event is described by only one element. {even, 1, 3, 4, 5} is not a valid sample space for a die since a value of 4 is matched both by ‘even’ and ‘4’.

The probabilities for all values of a discrete random value is known as the discrete probability distribution and the probabilities for all values of a continuous random value is known as the continuous probability distribution.

To be a probability distribution the probability of each value x i x_i xi must be x i ≥ 0 x_i \ge 0 xi0, since no probability can be less than zero. Secondly, the sum of the probabilities for all values must equal one. This should be intuitively clear for a coin toss: if the odds of getting heads is 70%, then the odds of getting tails must be 30%. We formulize this requirement as

∑ u P ( X = u ) = 1 \sum\limits_u P(X{=}u)= 1 uP(X=u)=1

for discrete distributions, and as

∫ u P ( X = u )   d u = 1 \int\limits_u P(X{=}u) \,du= 1 uP(X=u)du=1

for continuous distributions.

In the previous chapter we used probability distributions to estimate the position of a dog in a hallway. For example:

我们用小写的 p 表示这个分布: p ( x ) p(x) p(x)。使用普通的函数符号,我们会写:

P ( X = 4 ) = p ( 4 ) = 1 6 P(X{=}4) = p(4) = \frac{1}{6} P(X=4)=p(4)=61

这说明骰子落在 4 上的概率是 1 6 \frac{1}{6} 61 P ( X = x k ) P(X{=}x_k) P(X=xk) 是表示“ X X X x k x_k xk 的概率”的符号。请注意细微的符号差异。大写的 P P P表示单个事件的概率,小写的 p p p是概率分布函数。如果您不注意,这可能会使您误入歧途。一些文本使用 P r Pr Pr 而不是 P P P 来改善这一点。

另一个例子是公平的硬币。它有样本空间{H, T}。硬币是公平的,因此正面 (H) 的概率为 50%,反面 (T) 的概率为 50%。我们把它写成

P ( X = H ) = 0.5 P ( X = T ) = 0.5 \begin{gathered}P(X{=}H) = 0.5\\P(X{=}T)=0.5\end{gathered} P(X=H)=0.5P(X=T)=0.5

样本空间不是唯一的。骰子的一个样本空间是 {1, 2, 3, 4, 5, 6}。另一个有效的样本空间是{even, odd}。另一个可能是{所有角落都有点,而不是所有角落都有点}。只要样本空间涵盖所有可能性,并且任何单个事件仅由一个元素描述,样本空间就是有效的。 {even, 1, 3, 4, 5} 不是骰子的有效样本空间,因为值 4 与“even”和“4”都匹配。

离散随机值的所有值的概率称为离散概率分布连续随机值的所有值的概率称为连续概率分布

要成为概率分布,每个值 x i x_i xi 的概率必须为 x i ≥ 0 x_i \ge 0 xi0,因为没有概率可以小于零。其次,所有值的概率之和必须等于 1。这对于抛硬币来说应该很直观:如果正面的几率是 70%,那么反面的几率一定是 30%。我们将此要求表述为

∑ u P ( X = u ) = 1 \sum\limits_u P(X{=}u)= 1 uP(X=u)=1

对于离散分布,以及

∫ u P ( X = u )   d u = 1 \int\limits_u P(X{=}u) \,du= 1 uP(X=u)du=1

对于连续分布。

在上一章中,我们使用概率分布来估计狗在走廊中的位置。例如:

import numpy as np
import kf_book.book_plots as book_plots

belief = np.array([1, 4, 2, 0, 8, 2, 2, 35, 4, 3, 2])
belief = belief / np.sum(belief)
with book_plots.figsize(y=2):
    book_plots.bar_plot(belief)
print('sum = ', np.sum(belief))
sum =  1.0

[外链图片转存中…(img-LsoCu15e072c3f687a1564ffc2379f9471c8)]

Each position has a probability between 0 and 1, and the sum of all equals one, so this makes it a probability distribution. Each probability is discrete, so we can more precisely call this a discrete probability distribution. In practice we leave out the terms discrete and continuous unless we have a particular reason to make that distinction.

The Mean, Median, and Mode of a Random Variable

Given a set of data we often want to know a representative or average value for that set. There are many measures for this, and the concept is called a measure of central tendency. For example we might want to know the average height of the students in a class. We all know how to find the average of a set of data, but let me belabor the point so I can introduce more formal notation and terminology. Another word for average is the mean. We compute the mean by summing the values and dividing by the number of values. If the heights of the students in meters is

X = { 1.8 , 2.0 , 1.7 , 1.9 , 1.6 } X = \{1.8, 2.0, 1.7, 1.9, 1.6\} X={1.8,2.0,1.7,1.9,1.6}

we compute the mean as

μ = 1.8 + 2.0 + 1.7 + 1.9 + 1.6 5 = 1.8 \mu = \frac{1.8 + 2.0 + 1.7 + 1.9 + 1.6}{5} = 1.8 μ=51.8+2.0+1.7+1.9+1.6=1.8

It is traditional to use the symbol μ \mu μ (mu) to denote the mean.

We can formalize this computation with the equation

μ = 1 n ∑ i = 1 n x i \mu = \frac{1}{n}\sum^n_{i=1} x_i μ=n1i=1nxi

NumPy provides numpy.mean() for computing the mean.

随机变量的均值、中值和众数

给定一组数据,我们通常想知道该组的代表值或平均值。对此有很多衡量标准,这个概念称为集中趋势衡量标准。例如,我们可能想知道班级中学生的平均身高。我们都知道如何找到一组数据的平均值,但让我详细说明这一点,以便介绍更正式的符号和术语。平均值的另一个词是平均值。我们通过对值求和并除以值的数量来计算平均值。如果以米为单位的学生身高是 X = { 1.8 , 2.0 , 1.7 , 1.9 , 1.6 } X = \{1.8, 2.0, 1.7, 1.9, 1.6\} X={1.8,2.0,1.7,1.9,1.6} 我们计算平均值为 μ = 1.8 + 2.0 + 1.7 + 1.9 + 1.6 5 = 1.8 \mu = \frac{1.8 + 2.0 + 1.7 + 1.9 + 1.6 }{5} = 1.8 μ=51.8+2.0+1.7+1.9+1.6=1.8 传统上使用符号 μ \mu μ (mu) 来表示均值。

我们可以用等式 μ = 1 n ∑ i = 1 n x i \mu = \frac{1}{n}\sum^n_{i=1} x_i μ=n1i=1nxi 形式化这个计算 NumPy 提供了 numpy.mean() 来计算平均值。

x = [1.8, 2.0, 1.7, 1.9, 1.6]
np.mean(x)
1.8

As a convenience NumPy arrays provide the method mean().

x = np.array([1.8, 2.0, 1.7, 1.9, 1.6])
x.mean()
1.8

The mode of a set of numbers is the number that occurs most often. If only one number occurs most often we say it is a unimodal set, and if two or more numbers occur the most with equal frequency than the set is multimodal. For example the set {1, 2, 2, 2, 3, 4, 4, 4} has modes 2 and 4, which is multimodal, and the set {5, 7, 7, 13} has the mode 7, and so it is unimodal. We will not be computing the mode in this manner in this book, but we do use the concepts of unimodal and multimodal in a more general sense. For example, in the Discrete Bayes chapter we talked about our belief in the dog’s position as a multimodal distribution because we assigned different probabilities to different positions.

Finally, the median of a set of numbers is the middle point of the set so that half the values are below the median and half are above the median. Here, above and below is in relation to the set being sorted. If the set contains an even number of values then the two middle numbers are averaged together.

Numpy provides numpy.median() to compute the median. As you can see the median of {1.8, 2.0, 1.7, 1.9, 1.6} is 1.8, because 1.8 is the third element of this set after being sorted. In this case the median equals the mean, but that is not generally true.

np.median(x)
1.8

Expected Value of a Random Variable

The expected value of a random variable is the average value it would have if we took an infinite number of samples of it and then averaged those samples together. Let’s say we have x = [ 1 , 3 , 5 ] x=[1,3,5] x=[1,3,5] and each value is equally probable. What value would we expect x x x to have, on average?

It would be the average of 1, 3, and 5, of course, which is 3. That should make sense; we would expect equal numbers of 1, 3, and 5 to occur, so ( 1 + 3 + 5 ) / 3 = 3 (1+3+5)/3=3 (1+3+5)/3=3 is clearly the average of that infinite series of samples. In other words, here the expected value is the mean of the sample space.

Now suppose that each value has a different probability of happening. Say 1 has an 80% chance of occurring, 3 has an 15% chance, and 5 has only a 5% chance. In this case we compute the expected value by multiplying each value of x x x by the percent chance of it occurring, and summing the result. For this case we could compute

E [ X ] = ( 1 ) ( 0.8 ) + ( 3 ) ( 0.15 ) + ( 5 ) ( 0.05 ) = 1.5 \mathbb E[X] = (1)(0.8) + (3)(0.15) + (5)(0.05) = 1.5 E[X]=(1)(0.8)+(3)(0.15)+(5)(0.05)=1.5

Here I have introduced the notation E [ X ] \mathbb E[X] E[X] for the expected value of x x x. Some texts use E ( x ) E(x) E(x). The value 1.5 for x x x makes intuitive sense because x x x is far more likely to be 1 than 3 or 5, and 3 is more likely than 5 as well.

We can formalize this by letting x i x_i xi be the i t h i^{th} ith value of X X X, and p i p_i pi be the probability of its occurrence. This gives us

E [ X ] = ∑ i = 1 n p i x i \mathbb E[X] = \sum_{i=1}^n p_ix_i E[X]=i=1npixi

A trivial bit of algebra shows that if the probabilities are all equal, the expected value is the same as the mean:

E [ X ] = ∑ i = 1 n p i x i = 1 n ∑ i = 1 n x i = μ x \mathbb E[X] = \sum_{i=1}^n p_ix_i = \frac{1}{n}\sum_{i=1}^n x_i = \mu_x E[X]=i=1npixi=n1i=1nxi=μx

If x x x is continuous we substitute the sum for an integral, like so

E [ X ] = ∫ a b   x f ( x )   d x \mathbb E[X] = \int_{a}^b\, xf(x) \,dx E[X]=abxf(x)dx

where f ( x ) f(x) f(x) is the probability distribution function of x x x. We won’t be using this equation yet, but we will be using it in the next chapter.

We can write a bit of Python to simulate this. Here I take 1,000,000 samples and compute the expected value of the distribution we just computed analytically.

随机变量的期望值

随机变量的期望值是如果我们取无限个样本,然后取这些样本的平均值,它将具有的平均值。假设我们有 x = [ 1 , 3 , 5 ] x=[1,3,5] x=[1,3,5]每个值都是等概率的。平均而言,我们“期望” x x x的值是多少?它是1 3 5的平均值,当然是3。

这应该说得通;我们希望出现相等的1 3 5,所以 ( 1 + 3 + 5 ) / 3 = 3 (1+3+5)/3=3 (1+3+5)/3=3显然是无穷序列样本的平均值。换句话说,这里的期望值是样本空间的均值。

现在假设每个值都有不同的发生概率。假设1有80%的几率发生,3有15%的几率,5只有5%的几率。在这种情况下,我们通过将 x x x的每个值乘以其发生的概率百分比,并对结果求和来计算期望值。在这种情况下,我们可以计算 E [ X ] = ( 1 )( 0.8 ) + ( 3 )( 0.15 ) + ( 5 )( 0.05 ) = 1.5 \mathbb E[X]=(1)(0.8)+(3)(0.15)+(5)(0.05)=1.5 E[X]=1)(0.8+3)(0.15+5)(0.05=1.5这里我介绍了 X X X的预期值的符号 E [ X ] 。有些文本使用 \mathbb E[X]。有些文本使用 E[X]。有些文本使用E(x) 。 。 x 的值 1.5 很直观,因为 的值1.5很直观,因为 的值1.5很直观,因为x$更可能是1而不是3或5,3也更可能是5。

我们可以将其形式化,让 x i x_i xi X X X i t h i^{th} ith值, p i p_i pi为其出现的概率。这给了我们 E [ X ] = ∑ i = 1 n p i x i \mathbb E[X] = \sum_{i=1}^n p_ix_i E[X]=i=1npixi一个简单的代数表明,如果概率都相等,期望值与平均值相同: E [ X ] = ∑ i = 1 n p i x i = 1 n ∑ i = 1 n x i = μ x \mathbb E[X] = \sum_{i=1}^n p_ix_i = \frac{1}{n}\sum_{i=1}^n x_i = \mu_x E[X]=i=1npixi=n1i=1nxi=μx如果 x x x是连续的,我们将求和替换为积分,如 E [ X ] = ∫ a b   x f ( x )   d x \mathbb E[X] = \int_{a}^b\, xf(x) \,dx E[X]=abxf(x)dx,其中 f ( x ) f(x) f(x) x x x的概率分布函数。我们现在不会用到这个方程,但下一章会用到。

我们可以写一点Python来模拟这个。这里我取1,000,000个样本,然后计算刚刚分析计算的分布的期望值。

total = 0
N = 1000000
for r in np.random.rand(N):
    if r <= .80: total += 1
    elif r < .95: total += 3
    else: total += 5

total / N
1.499908

You can see that the computed value is close to the analytically derived value. It is not exact because getting an exact values requires an infinite sample size.

Exercise

What is the expected value of a die roll?

Solution

Each side is equally likely, so each has a probability of 1/6. Hence
E [ X ] = 1 / 6 × 1 + 1 / 6 × 2 + 1 / 6 × 3 + 1 / 6 × 4 + 1 / 6 × 5 + 1 / 6 × 6 = 1 / 6 ( 1 + 2 + 3 + 4 + 5 + 6 ) = 3.5 \begin{aligned} \mathbb E[X] &= 1/6\times1 + 1/6\times 2 + 1/6\times 3 + 1/6\times 4 + 1/6\times 5 + 1/6\times6 \\ &= 1/6(1+2+3+4+5+6)\\&= 3.5\end{aligned} E[X]=1/6×1+1/6×2+1/6×3+1/6×4+1/6×5+1/6×6=1/6(1+2+3+4+5+6)=3.5

Exercise

Given the uniform continuous distribution

f ( x ) = 1 b − a f(x) = \frac{1}{b - a} f(x)=ba1

compute the expected value for a = 0 a=0 a=0 and b = 20 b=20 b=20.

Solution

E [ X ] = ∫ 0 20   x 1 20   d x = [ x 2 40 ] 0 20 = 10 − 0 = 10 \begin{aligned} \mathbb E[X] &= \int_0^{20}\, x\frac{1}{20} \,dx \\ &= \bigg[\frac{x^2}{40}\bigg]_0^{20} \\ &= 10 - 0 \\ &= 10 \end{aligned} E[X]=020x201dx=[40x2]020=100=10

Variance of a Random Variable

The computation above tells us the average height of the students, but it doesn’t tell us everything we might want to know. For example, suppose we have three classes of students, which we label X X X, Y Y Y, and Z Z Z, with these heights:

随机变量的方差

上述告诉告诉了我们学生的平均身高,但它并没有告诉我们我们可能想知道的一切。例如,假设我们有三类学生,我们将其标记为 X X X Y Y Y Z Z Z,高度如下:

X = [1.8, 2.0, 1.7, 1.9, 1.6]
Y = [2.2, 1.5, 2.3, 1.7, 1.3]
Z = [1.8, 1.8, 1.8, 1.8, 1.8]

Using NumPy we see that the mean height of each class is the same.

print(np.mean(X), np.mean(Y), np.mean(Z))
1.8 1.8 1.8

The mean of each class is 1.8 meters, but notice that there is a much greater amount of variation in the heights in the second class than in the first class, and that there is no variation at all in the third class.

The mean tells us something about the data, but not the whole story. We want to be able to specify how much variation there is between the heights of the students. You can imagine a number of reasons for this. Perhaps a school district needs to order 5,000 desks, and they want to be sure they buy sizes that accommodate the range of heights of the students.

Statistics has formalized this concept of measuring variation into the notion of standard deviation and variance. The equation for computing the variance is

V A R ( X ) = E [ ( X − μ ) 2 ] \mathit{VAR}(X) = \mathbb E[(X - \mu)^2] VAR(X)=E[(Xμ)2]

Ignoring the square for a moment, you can see that the variance is the expected value for how much the sample space X X X varies from the mean μ : \mu: μ: ( X − μ ) X-\mu) Xμ). I will explain the purpose of the squared term later. The formula for the expected value is E [ X ] = ∑ i = 1 n p i x i \mathbb E[X] = \sum\limits_{i=1}^n p_ix_i E[X]=i=1npixi so we can substitute that into the equation above to get

V A R ( X ) = 1 n ∑ i = 1 n ( x i − μ ) 2 \mathit{VAR}(X) = \frac{1}{n}\sum_{i=1}^n (x_i - \mu)^2 VAR(X)=n1i=1n(xiμ)2

Let’s compute the variance of the three classes to see what values we get and to become familiar with this concept.

The mean of X X X is 1.8 ( μ x = 1.8 \mu_x = 1.8 μx=1.8) so we compute

V A R ( X ) = ( 1.8 − 1.8 ) 2 + ( 2 − 1.8 ) 2 + ( 1.7 − 1.8 ) 2 + ( 1.9 − 1.8 ) 2 + ( 1.6 − 1.8 ) 2 5 = 0 + 0.04 + 0.01 + 0.01 + 0.04 5 V A R ( X ) = 0.02   m 2 \begin{aligned} \mathit{VAR}(X) &=\frac{(1.8-1.8)^2 + (2-1.8)^2 + (1.7-1.8)^2 + (1.9-1.8)^2 + (1.6-1.8)^2} {5} \\ &= \frac{0 + 0.04 + 0.01 + 0.01 + 0.04}{5} \\ \mathit{VAR}(X)&= 0.02 \, m^2 \end{aligned} VAR(X)VAR(X)=5(1.81.8)2+(21.8)2+(1.71.8)2+(1.91.8)2+(1.61.8)2=50+0.04+0.01+0.01+0.04=0.02m2

NumPy provides the function var() to compute the variance:

每个班级的平均值为 1.8 米,但请注意,第二班级的身高变化量比第一班级大得多,而第三班级则完全没有变化。

均值告诉我们一些关于数据的信息,但不是全部。我们希望能够指定学生身高之间的差异。你可以想象出很多原因。也许一个学区需要订购 5,000 张课桌,他们希望确保购买的课桌尺寸适合学生的身高范围。统计学已经将这种测量变异的概念形式化为 标准差方差。计算方差的方程是 V A R ( X ) = E [ ( X − μ ) 2 ] \mathit{VAR}(X) = \mathbb E[(X - \mu)^2] VAR(X)=E[(Xμ)2] 暂时忽略平方,可以看到方差是预期的值表示样本空间 X X X 与平均值 μ : \mu: μ: ( X − μ ) X-\mu) Xμ) 的差异。稍后我将解释平方项的用途。期望值的公式是 E [ X ] = ∑ i = 1 n p i x i \mathbb E[X] = \sum\limits_{i=1}^n p_ix_i E[X]=i=1npixi 所以我们可以将其代入上面的等式得到 V A R ( X ) = 1 n ∑ i = 1 n ( x i − μ ) 2 \mathit{VAR}(X) = \frac{1}{n}\sum_{i=1}^n (x_i - \mu)^2 VAR(X)=n1i=1n(xiμ)2 让我们计算三个类别的方差,看看我们得到什么值,并熟悉这个概念。

X X X 的平均值是 1.8 ( μ x = 1.8 \mu_x = 1.8 μx=1.8) 所以我们计算 V A R ( X ) = ( 1.8 − 1.8 ) 2 + ( 2 − 1.8 ) 2 + ( 1.7 − 1.8 ) 2 + ( 1.9 − 1.8 ) 2 + ( 1.6 − 1.8 ) 2 5 = 0 + 0.04 + 0.01 + 0.01 + 0.04 5 V A R ( X ) = 0.02   m 2 \begin{aligned} \mathit{VAR}(X) &=\frac{(1.8-1.8)^2 + (2- 1.8)^2 + (1.7-1.8)^2 + (1.9-1.8)^2 + (1.6-1.8)^2} {5} \\ &= \frac{0 + 0.04 + 0.01 + 0.01 + 0.04}{ 5} \\ \mathit{VAR}(X)&= 0.02 \, m^2 \end{aligned} VAR(X)VAR(X)=5(1.81.8)2+(21.8)2+(1.71.8)2+(1.91.8)2+(1.61.8)2=50+0.04+0.01+0.01+0.04=0.02m2 NumPy 提供函数 var() 来计算方差:

print(f"{np.var(X):.2f} meters squared")
0.02 meters squared

This is perhaps a bit hard to interpret. Heights are in meters, yet the variance is meters squared. Thus we have a more commonly used measure, the standard deviation, which is defined as the square root of the variance:

σ = V A R ( X ) = 1 n ∑ i = 1 n ( x i − μ ) 2 \sigma = \sqrt{\mathit{VAR}(X)}=\sqrt{\frac{1}{n}\sum_{i=1}^n(x_i - \mu)^2} σ=VAR(X) =n1i=1n(xiμ)2

It is typical to use σ \sigma σ for the standard deviation and σ 2 \sigma^2 σ2 for the variance. In most of this book I will be using σ 2 \sigma^2 σ2 instead of V A R ( X ) \mathit{VAR}(X) VAR(X) for the variance; they symbolize the same thing.

For the first class we compute the standard deviation with
σ x = ( 1.8 − 1.8 ) 2 + ( 2 − 1.8 ) 2 + ( 1.7 − 1.8 ) 2 + ( 1.9 − 1.8 ) 2 + ( 1.6 − 1.8 ) 2 5 = 0 + 0.04 + 0.01 + 0.01 + 0.04 5 σ x = 0.1414 \begin{aligned} \sigma_x &=\sqrt{\frac{(1.8-1.8)^2 + (2-1.8)^2 + (1.7-1.8)^2 + (1.9-1.8)^2 + (1.6-1.8)^2} {5}} \\ &= \sqrt{\frac{0 + 0.04 + 0.01 + 0.01 + 0.04}{5}} \\ \sigma_x&= 0.1414 \end{aligned} σxσx=5(1.81.8)2+(21.8)2+(1.71.8)2+(1.91.8)2+(1.61.8)2 =50+0.04+0.01+0.01+0.04 =0.1414

We can verify this computation with the NumPy method numpy.std() which computes the standard deviation. ‘std’ is a common abbreviation for standard deviation.

这可能有点难以解释。高度以米为单位,但方差是平方米。因此我们有一个更常用的度量,标准差,它被定义为方差的平方根:

σ = V A R ( X ) = 1 n ∑ i = 1 n ( x i − μ ) 2 \sigma = \sqrt{\mathit{VAR}(X)}=\sqrt{\frac{ 1}{n}\sum_{i=1}^n(x_i - \mu)^2} σ=VAR(X) =n1i=1n(xiμ)2

通常使用 σ \sigma σ 作为标准差,使用 σ 2 \sigma^2 σ2 作为 方差。在本书的大部分内容中,我将使用 σ 2 \sigma^2 σ2 而不是 V A R ( X ) \mathit{VAR}(X) VAR(X) 作为方差;他们象征着同一件事。
对于第一类,我们用以下公式计算标准差

σ x = ( 1.8 − 1.8 ) 2 + ( 2 − 1.8 ) 2 + ( 1.7 − 1.8 ) 2 + ( 1.9 − 1.8 ) 2 + ( 1.6 − 1.8 ) 2 5 = 0 + 0.04 + 0.01 + 0.01 + 0.04 5 σ x = 0.1414 \begin{aligned} \sigma_x &=\sqrt{\frac{(1.8-1.8)^2 + (2-1.8)^2 + (1.7-1.8)^2 + (1.9-1.8)^2 + (1.6-1.8)^2} {5}} \\ &= \sqrt{\frac{0 + 0.04 + 0.01 + 0.01 + 0.04}{5}} \\ \sigma_x&= 0.1414 \end{aligned} σxσx=5(1.81.8)2+(21.8)2+(1.71.8)2+(1.91.8)2+(1.61.8)2 =50+0.04+0.01+0.01+0.04 =0.1414

我们可以使用计算标准差的 NumPy 方法 numpy.std() 来验证这个计算。 std是标准偏差的常见缩写。

print(f"std {np.std(X):.4f}")
print(f"var {np.std(X)**2:.4f}")
std 0.1414
var 0.0200

And, of course, 0.141 4 2 = 0.02 0.1414^2 = 0.02 0.14142=0.02, which agrees with our earlier computation of the variance.

What does the standard deviation signify? It tells us how much the heights vary amongst themselves. “How much” is not a mathematical term. We will be able to define it much more precisely once we introduce the concept of a Gaussian in the next section. For now I’ll say that for many things 68% of all values lie within one standard deviation of the mean. In other words we can conclude that for a random class 68% of the students will have heights between 1.66 (1.8-0.1414) meters and 1.94 (1.8+0.1414) meters.

We can view this in a plot:

当然, 0.141 4 2 = 0.02 0.1414^2 = 0.02 0.14142=0.02,这与我们之前的方差计算一致。

标准偏差意味着什么?它告诉我们高度之间的差异有多大。 “多少”不是一个数学术语。一旦我们在下一节中介绍了高斯的概念,我们将能够更准确地定义它。现在我要说的是,对于很多事情,所有值的 68% 都在均值的一个标准差范围内。换句话说,我们可以得出结论,对于一个随机班级,68% 的学生身高介于 1.66 (1.8-0.1414) 米和 1.94 (1.8+0.1414) 米之间。我们可以在图中查看:

from kf_book.gaussian_internal import plot_height_std
import matplotlib.pyplot as plt

plot_height_std(X)


[外链图片转存中…(img-Rn9WSgko3a02fd2112a9bd2d07089933fd855c5.png)

For only 5 students we obviously will not get exactly 68% within one standard deviation. We do see that 3 out of 5 students are within ± 1 σ \pm1\sigma ±1σ, or 60%, which is as close as you can get to 68% with only 5 samples. Let’s look at the results for a class with 100 students.

We write one standard deviation as 1 σ 1\sigma 1σ, which is pronounced “one standard deviation”, not “one sigma”. Two standard deviations is 2 σ 2\sigma 2σ, and so on.

对于只有 5 个学生,我们显然不会在一个标准差内得到准确的 68%。我们确实看到 5 名学生中有 3 名在 ± 1 σ \pm1\sigma ±1σ 内,即 60%,这与仅使用 5 个样本即可达到的 68% 非常接近。让我们看一下有 100 名学生的班级的结果。

我们把一个标准差写成 1 σ 1\sigma 1σ,读作“one standard deviation”,而不是“one sigma”。两个标准差是 2 σ 2\sigma 2σ,依此类推。

from numpy.random import randn
data = 1.8 + randn(100)*.1414
mean, std = data.mean(), data.std()

plot_height_std(data, lw=2)
print(f'mean = {mean:.3f}')
print(f'std  = {std:.3f}')


[外链图片转存中…(img-GZgxCer3820c26ec664beda404316471703f650n]g)

mean = 1.810
std  = 0.150

By eye roughly 68% of the heights lie within ± 1 σ \pm1\sigma ±1σ of the mean 1.8, but we can verify this with code.

np.sum((data > mean-std) & (data < mean+std)) / len(data) * 100.
64.0

We’ll discuss this in greater depth soon. For now let’s compute the standard deviation for

Y = [ 2.2 , 1.5 , 2.3 , 1.7 , 1.3 ] Y = [2.2, 1.5, 2.3, 1.7, 1.3] Y=[2.2,1.5,2.3,1.7,1.3]

The mean of Y Y Y is μ = 1.8 \mu=1.8 μ=1.8 m, so

σ y = ( 2.2 − 1.8 ) 2 + ( 1.5 − 1.8 ) 2 + ( 2.3 − 1.8 ) 2 + ( 1.7 − 1.8 ) 2 + ( 1.3 − 1.8 ) 2 5 = 0.152 = 0.39   m \begin{aligned} \sigma_y &=\sqrt{\frac{(2.2-1.8)^2 + (1.5-1.8)^2 + (2.3-1.8)^2 + (1.7-1.8)^2 + (1.3-1.8)^2} {5}} \\ &= \sqrt{0.152} = 0.39 \ m \end{aligned} σy=5(2.21.8)2+(1.51.8)2+(2.31.8)2+(1.71.8)2+(1.31.8)2 =0.152 =0.39 m

We will verify that with NumPy with

我们很快就会更深入地讨论这个问题。现在让我们计算 Y = [ 2.2 , 1.5 , 2.3 , 1.7 , 1.3 ] Y = [2.2, 1.5, 2.3, 1.7, 1.3] Y=[2.2,1.5,2.3,1.7,1.3] 的标准差

Y Y Y 的平均值是 μ = 1.8 \mu=1.8 μ=1.8 m,所以 σ y = ( 2.2 − 1.8 ) 2 + ( 1.5 − 1.8 ) 2 + ( 2.3 − 1.8 ) 2 + ( 1.7 − 1.8 ) 2 + ( 1.3 − 1.8 ) 2 5 = 0.152 = 0.39   m \begin{aligned} \sigma_y &=\sqrt{\frac{(2.2-1.8)^2 + (1.5-1.8)^2 + (2.3-1.8)^2 + (1.7-1.8)^2 + (1.3-1.8)^2} {5 }} \\ &= \sqrt{0.152} = 0.39 \ m \end{aligned} σy=5(2.21.8)2+(1.51.8)2+(2.31.8)2+(1.71.8)2+(1.31.8)2 =0.152 =0.39 m

我们将用 NumPy 验证

print(f'std of Y is {np.std(Y):.2f} m')
std of Y is 0.39 m

This corresponds with what we would expect. There is more variation in the heights for Y Y Y, and the standard deviation is larger.

Finally, let’s compute the standard deviation for Z Z Z. There is no variation in the values, so we would expect the standard deviation to be zero.

这符合我们的预期。 Y Y Y 的高度变化更大,标准差更大。

最后,让我们计算 Z Z Z 的标准差。值没有变化,因此我们预计标准偏差为零。
σ z = ( 1.8 − 1.8 ) 2 + ( 1.8 − 1.8 ) 2 + ( 1.8 − 1.8 ) 2 + ( 1.8 − 1.8 ) 2 + ( 1.8 − 1.8 ) 2 5 = 0 + 0 + 0 + 0 + 0 5 σ z = 0.0   m \begin{aligned} \sigma_z &=\sqrt{\frac{(1.8-1.8)^2 + (1.8-1.8)^2 + (1.8-1.8)^2 + (1.8-1.8)^2 + (1.8-1.8)^2} {5}} \\ &= \sqrt{\frac{0+0+0+0+0}{5}} \\ \sigma_z&= 0.0 \ m \end{aligned} σzσz=5(1.81.8)2+(1.81.8)2+(1.81.8)2+(1.81.8)2+(1.81.8)2 =50+0+0+0+0 =0.0 m

print(np.std(Z))
0.0

Before we continue I need to point out that I’m ignoring that on average men are taller than women. In general the height variance of a class that contains only men or women will be smaller than a class with both sexes. This is true for other factors as well. Well nourished children are taller than malnourished children. Scandinavians are taller than Italians. When designing experiments statisticians need to take these factors into account.

I suggested we might be performing this analysis to order desks for a school district. For each age group there are likely to be two different means - one clustered around the mean height of the females, and a second mean clustered around the mean heights of the males. The mean of the entire class will be somewhere between the two. If we bought desks for the mean of all students we are likely to end up with desks that fit neither the males or females in the school!

We will not to consider these issues in this book. Consult any standard probability text if you need to learn techniques to deal with these issues.

在我们继续之前,我需要指出我忽略了男性平均比女性高的事实。一般来说,只包含男性或女性的类的身高方差将小于包含两种性别的类。对于其他因素也是如此。营养良好的儿童比营养不良的儿童更高。斯堪的纳维亚人比意大利人高。在设计实验时,统计学家需要考虑这些因素。

我建议我们可以执行此分析来为学区订购课桌。对于每个年龄组,可能有两种不同的平均值 - 一种围绕女性的平均身高聚集,第二个平均值围绕男性的平均身高聚集。整个班级的平均值将介于两者之间。如果我们为所有学生的平均水平购买课桌,我们最终可能会得到既不适合学校男生也不适合女生的课桌!

我们不会在本书中考虑这些问题。如果您需要学习处理这些问题的技术,请查阅任何标准概率文本。

Why the Square of the Differences

Why are we taking the square of the differences for the variance? I could go into a lot of math, but let’s look at this in a simple way. Here is a chart of the values of X X X plotted against the mean for X = [ 3 , − 3 , 3 , − 3 ] X=[3,-3,3,-3] X=[3,3,3,3]

为什么我们要用差异的平方作为方差?我可以涉及很多数学,但让我们以简单的方式来看一下。这是 X X X 的值相对于 X = [ 3 , − 3 , 3 , − 3 ] X=[3,-3,3,-3] X=[3,3,3,3] 的平均值绘制的图表

X = [3, -3, 3, -3]
mean = np.average(X)
for i in range(len(X)):
    plt.plot([i ,i], [mean, X[i]], color='k')
plt.axhline(mean)
plt.xlim(-1, len(X))
plt.tick_params(axis='x', labelbottom=False)


[外链图片转存中…(img-EhWp4i2x-163fda58fff2254971214233b061)]

If we didn’t take the square of the differences the signs would cancel everything out:

( 3 − 0 ) + ( − 3 − 0 ) + ( 3 − 0 ) + ( − 3 − 0 ) 4 = 0 \frac{(3-0) + (-3-0) + (3-0) + (-3-0)}{4} = 0 4(30)+(30)+(30)+(30)=0

This is clearly incorrect, as there is more than 0 variance in the data.

Maybe we can use the absolute value? We can see by inspection that the result is 12 / 4 = 3 12/4=3 12/4=3 which is certainly correct — each value varies by 3 from the mean. But what if we have Y = [ 6 , − 2 , − 3 , 1 ] Y=[6, -2, -3, 1] Y=[6,2,3,1]? In this case we get 12 / 4 = 3 12/4=3 12/4=3. Y Y Y is clearly more spread out than X X X, but the computation yields the same variance. If we use the formula using squares we get a variance of 3.5 for Y Y Y, which reflects its larger variation.

This is not a proof of correctness. Indeed, Carl Friedrich Gauss, the inventor of the technique, recognized that it is somewhat arbitrary. If there are outliers then squaring the difference gives disproportionate weight to that term. For example, let’s see what happens if we have:

如果我们不求差的平方,符号会抵消所有内容: ( 3 − 0 ) + ( − 3 − 0 ) + ( 3 − 0 ) + ( − 3 − 0 ) 4 = 0 \frac{(3-0) + (-3-0) + (3-0) + (-3-0)}{ 4} = 0 4(30)+(30)+(30)+(30)=0

这显然是不正确的,因为数据中的方差大于 0。也许我们可以使用绝对值?我们可以通过检查看到结果是 12 / 4 = 3 12/4=3 12/4=3,这当然是正确的——每个值都与平均值相差 3。但是如果我们有 Y = [ 6 , − 2 , − 3 , 1 ] Y=[6, -2, -3, 1] Y=[6,2,3,1] 呢?在这种情况下,我们得到 12 / 4 = 3 12/4=3 12/4=3 Y Y Y 显然比 X X X 分散得更多,但计算得出相同的方差。

如果我们使用带正方形的公式,我们得到 Y Y Y 的方差为 3.5,这反映了它较大的变化。

这不是正确性的证明。事实上,该技术的发明者卡尔·弗里德里希·高斯 (Carl Friedrich Gauss) 认识到它有些武断。如果存在异常值,则对差异进行平方会给该术语带来不成比例的权重。例如,让我们看看如果我们有会发生什么:

X = [1, -1, 1, -2, -1, 2, 1, 2, -1, 1, -1, 2, 1, -2, 100]
print(f'Variance of X with outlier    = {np.var(X):6.2f}')
print(f'Variance of X without outlier = {np.var(X[:-1]):6.2f}')
Variance of X with outlier    = 621.45
Variance of X without outlier =   2.03

Is this “correct”? You tell me. Without the outlier of 100 we get σ 2 = 2.03 \sigma^2=2.03 σ2=2.03, which accurately reflects how X X X is varying absent the outlier. The one outlier swamps the variance computation. Do we want to swamp the computation so we know there is an outlier, or robustly incorporate the outlier and still provide an estimate close to the value absent the outlier? Again, you tell me. Obviously it depends on your problem.

I will not continue down this path; if you are interested you might want to look at the work that James Berger has done on this problem, in a field called Bayesian robustness, or the excellent publications on robust statistics by Peter J. Huber [4]. In this book we will always use variance and standard deviation as defined by Gauss.

The point to gather from this is that these summary statistics always tell an incomplete story about our data. In this example variance as defined by Gauss does not tell us we have a single large outlier. However, it is a powerful tool, as we can concisely describe a large data set with a few numbers. If we had 1 billion data points we would not want to inspect plots by eye or look at lists of numbers; summary statistics give us a way to describe the shape of the data in a useful way.

它是否正确?你告诉我。没有异常值 100,我们得到 σ 2 = 2.03 \sigma^2=2.03 σ2=2.03,它准确地反映了 X X X 在没有异常值的情况下是如何变化的。一个异常值淹没了方差计算。我们是想淹没计算以便我们知道存在异常值,还是稳健地合并异常值并仍然提供接近没有异常值的值的估计值?再说一遍,你告诉我。显然这取决于你的问题。

我不会继续走这条路;如果您有兴趣,您可能想看看 James Berger 在这个问题上所做的工作,在一个名为 Bayesian robustness 的领域,或者 Peter J. Huber [4] 关于 robust statistics 的优秀出版物。在本书中,我们将始终使用高斯定义的方差和标准差。

从这里收集的要点是,这些摘要统计数据总是讲述我们数据的不完整故事。在这个例子中,高斯定义的方差并没有告诉我们有一个大的离群值。然而,它是一个强大的工具,因为我们可以用几个数字简洁地描述一个大数据集。如果我们有 10 亿个数据点,我们就不想用眼睛检查图表或查看数字列表;摘要统计为我们提供了一种以有用的方式描述数据形状的方法。

Gaussians

We are now ready to learn about Gaussians. Let’s remind ourselves of the motivation for this chapter.

We desire a unimodal, continuous way to represent probabilities that models how the real world works, and that is computationally efficient to calculate.

Let’s look at a graph of a Gaussian distribution to get a sense of what we are talking about.

高斯

人我们现在准备学习高斯。让我们提醒自己这一章的动机。

我们希望有一种单峰、连续的方式来表示模拟真实世界如何工作的概率,并且计算效率很高。

让我们看一个高斯分布图来了解我们正在谈论的内容。

from filterpy.stats import plot_gaussian_pdf
plot_gaussian_pdf(mean=1.8, variance=0.1414**2, 
                  xlabel='Student Height', ylabel='pdf');


[外链图片转存中…(img6b584dbcdc40227a14447cd8ffc9295.png)

This curve is a probability density function or pdf for short. It shows the relative likelihood for the random variable to take on a value. We can tell from the chart student is somewhat more likely to have a height near 1.8 m than 1.7 m, and far more likely to have a height of 1.9 m vs 1.4 m. Put another way, many students will have a height near 1.8 m, and very few students will have a height of 1.4 m or 2.2 meters. Finally, notice that the curve is centered over the mean of 1.8 m.

I explain how to plot Gaussians, and much more, in the Notebook Computing_and_Plotting_PDFs in the
Supporting_Notebooks folder. You can read it online here [1].

This may be recognizable to you as a ‘bell curve’. This curve is ubiquitous because under real world conditions many observations are distributed in such a manner. I will not use the term ‘bell curve’ to refer to a Gaussian because many probability distributions have a similar bell curve shape. Non-mathematical sources might not be as precise, so be judicious in what you conclude when you see the term used without definition.

This curve is not unique to heights — a vast amount of natural phenomena exhibits this sort of distribution, including the sensors that we use in filtering problems. As we will see, it also has all the attributes that we are looking for — it represents a unimodal belief or value as a probability, it is continuous, and it is computationally efficient. We will soon discover that it also has other desirable qualities which we may not realize we desire.

To further motivate you, recall the shapes of the probability distributions in the Discrete Bayes chapter:

此曲线是 概率密度函数 或简称 pdf。它显示了随机变量取值的相对可能性。从图表中我们可以看出,学生身高接近 1.8 m 的可能性比接近 1.7 m 的可能性更大,身高接近 1.9 m 和 1.4 m 的可能性更大。换句话说,很多学生的身高会在1.8米左右,很少有学生会达到1.4米或2.2米。最后,请注意曲线以 1.8 m 的平均值为中心。

我在 Supporting_Notebooks 文件夹中的 Notebook Computing_and_Plotting_PDFs 中解释了如何绘制高斯分布等等。您可以 此处 [1] 在线阅读。

您可能会将其识别为“钟形曲线”。这条曲线无处不在,因为在现实世界条件下,许多观测值都是以这种方式分布的。我不会使用术语“钟形曲线”来指代高斯分布,因为许多概率分布都具有相似的钟形曲线形状。非数学来源可能不那么精确,因此当您看到使用的术语没有定义时,请谨慎判断您的结论。

这条曲线并不是高度所独有的——大量的自然现象都表现出这种分布,包括我们在过滤问题中使用的传感器。正如我们将看到的,它还具有我们正在寻找的所有属性——它代表一个单峰置信度或概率值,它是连续的,并且计算效率高。我们很快就会发现它还有其他我们可能没有意识到的令人向往的品质。

为了进一步启发你,回忆一下离散贝叶斯章节中概率分布的形状:

import kf_book.book_plots as book_plots
belief = [0., 0., 0., 0.1, 0.15, 0.5, 0.2, .15, 0, 0]
book_plots.bar_plot(belief)


[外链图片转存中…(img-sm86af7df3e39882a0d77b)]

They were not perfect Gaussian curves, but they were similar. We will be using Gaussians to replace the discrete probabilities used in that chapter!

它们不是完美的高斯曲线,但它们很相似。我们将使用高斯来代替该章中使用的离散概率!

Nomenclature

A bit of nomenclature before we continue - this chart depicts the probability density of a random variable having any value between ( − ∞ . . ∞ ) -\infty..\infty) ∞..∞). What does that mean? Imagine we take an infinite number of infinitely precise measurements of the speed of automobiles on a section of highway. We could then plot the results by showing the relative number of cars going past at any given speed. If the average was 120 kph, it might look like this:

术语

在我们继续之前先了解一下术语——这张图表描绘了一个随机变量概率密度,其值介于( − ∞ . . ∞ ) -\infty..\infty) ∞..∞) 之间。这意味着什么?想象一下,我们对一段高速公路上的汽车速度进行了无数次无限精确的测量。然后我们可以通过显示以任何给定速度经过的汽车的相对数量来绘制结果。如果平均值是 120 公里/小时,它可能看起来像这样:

plot_gaussian_pdf(mean=120, variance=17**2, xlabel='speed(kph)');


[外链图片转存中…(img-ze8d3ed759cdb05a22ea3a1b966661746763g])

The y-axis depicts the probability density — the relative amount of cars that are going the speed at the corresponding x-axis. I will explain this further in the next section.

The Gaussian model is imperfect. Though these charts do not show it, the tails of the distribution extend out to infinity. Tails are the far ends of the curve where the values are the lowest. Of course human heights or automobile speeds cannot be less than zero, let alone − ∞ -\infty or ∞ \infty . “The map is not the territory” is a common expression, and it is true for Bayesian filtering and statistics. The Gaussian distribution above models the distribution of the measured automobile speeds, but being a model it is necessarily imperfect. The difference between model and reality will come up again and again in these filters. Gaussians are used in many branches of mathematics, not because they perfectly model reality, but because they are easier to use than any other relatively accurate choice. However, even in this book Gaussians will fail to model reality, forcing us to use computationally expensive alternatives.

You will hear these distributions called Gaussian distributions or normal distributions. Gaussian and normal both mean the same thing in this context, and are used interchangeably. I will use both throughout this book as different sources will use either term, and I want you to be used to seeing both. Finally, as in this paragraph, it is typical to shorten the name and talk about a Gaussian or normal — these are both typical shortcut names for the Gaussian distribution.

y 轴表示概率密度——以相应 x 轴速度行驶的汽车的相对数量。我将在下一节中进一步解释这一点。

高斯模型是不完美的。虽然这些图表没有显示,但分布的尾部延伸到无穷大。 尾部 是曲线的远端,其中值最低。当然,人的身高或汽车速度不能小于零,更不用说 − ∞ -\infty ∞ \infty 了。 “地图不是领土”是一个常见的表达方式,对于贝叶斯过滤和统计来说也是如此。上面的高斯分布模拟了测得的汽车速度的分布,但作为一个模型,它必然是不完美的。模型和现实之间的差异将在这些过滤器中一次又一次地出现。高斯分布在许多数学分支中使用,不是因为它们完美地模拟了现实,而是因为它们比任何其他相对准确的选择都更容易使用。然而,即使在这本书中,高斯也无法模拟现实,迫使我们使用计算成本高昂的替代方案。

您会听到这些分布称为高斯分布正态分布Gaussiannormal 在此上下文中表示同一事物,并且可以互换使用。我将在整本书中使用这两个术语,因为不同的来源会使用其中一个术语,我希望你习惯于看到这两个术语。最后,在本段中,通常会缩短名称并谈论 Gaussiannormal — 这些都是 Gaussian distribution 的典型缩写名称。

Gaussian Distributions

Let’s explore how Gaussians work. A Gaussian is a continuous probability distribution that is completely described with two parameters, the mean ( μ \mu μ) and the variance ( σ 2 \sigma^2 σ2). It is defined as:

f ( x , μ , σ ) = 1 σ 2 π exp ⁡ [ − ( x − μ ) 2 2 σ 2 ] f(x, \mu, \sigma) = \frac{1}{\sigma\sqrt{2\pi}} \exp\big [{-\frac{(x-\mu)^2}{2\sigma^2} }\big ] f(x,μ,σ)=σ2π 1exp[2σ2(xμ)2]

exp ⁡ [ x ] \exp[x] exp[x] is notation for e x e^x ex.

Don't be dissuaded by the equation if you haven't seen it before; you will not need to memorize or manipulate it. The computation of this function is stored in `stats.py` with the function `gaussian(x, mean, var, normed=True)`.

Shorn of the constants, you can see it is a simple exponential:

f ( x ) ∝ e − x 2 f(x)\propto e^{-x^2} f(x)ex2

which has the familiar bell curve shape

高斯分布

让我们探讨一下高斯分布是如何工作的。高斯分布是一个连续概率分布,完全用两个参数来描述,均值 ( μ \mu μ) 和方差 ( σ 2 \sigma^2 σ2)。它被定义为: f ( x , μ , σ ) = 1 σ 2 π exp ⁡ [ − ( x − μ ) 2 2 σ 2 ] f(x, \mu, \sigma) = \frac{1}{\sigma\sqrt{2\pi}} \exp\big [{-\frac{(x-\mu) ^2}{2\sigma^2} }\big ] f(x,μ,σ)=σ2π 1exp[2σ2(xμ)2] exp ⁡ [ x ] \exp[x] exp[x] e x e^x ex 的符号。

如果您以前没有见过方程式,请不要被它吓倒;你不需要记住或操纵它。该函数的计算结果存储在带有函数 gaussian(x, mean, var, normed=True) 的 stats.py 中。

去掉常数,您可以看到它是一个简单的指数函数: f ( x ) ∝ e − x 2 f(x)\propto e^{-x^2} f(x)ex2 具有熟悉的钟形曲线形状

x = np.arange(-3, 3, .01)
plt.plot(x, np.exp(-x**2));


[外链图片转存中…(imge9cb26de1f70b5b23cbab86cd0e2a4c.png)

Let’s remind ourselves how to look at the code for a function. In a cell, type the function name followed by two question marks and press CTRL+ENTER. This will open a popup window displaying the source. Uncomment the next cell and try it now.

让我们提醒自己如何查看函数的代码。在单元格中,键入函数名称,后跟两个问号,然后按 CTRL+ENTER。这将打开一个显示源的弹出窗口。取消注释下一个单元格并立即尝试。

from filterpy.stats import gaussian
gaussian??
[1;31mSignature:[0m [0mgaussian[0m[1;33m([0m[0mx[0m[1;33m,[0m [0mmean[0m[1;33m,[0m [0mvar[0m[1;33m,[0m [0mnormed[0m[1;33m=[0m[1;32mTrue[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mSource:[0m   
[1;32mdef[0m [0mgaussian[0m[1;33m([0m[0mx[0m[1;33m,[0m [0mmean[0m[1;33m,[0m [0mvar[0m[1;33m,[0m [0mnormed[0m[1;33m=[0m[1;32mTrue[0m[1;33m)[0m[1;33m:[0m[1;33m
[0m    [1;34m"""
    returns normal distribution (pdf) for x given a Gaussian with the
    specified mean and variance. All must be scalars.

    gaussian (1,2,3) is equivalent to scipy.stats.norm(2,math.sqrt(3)).pdf(1)
    It is quite a bit faster albeit much less flexible than the latter.

    Parameters
    ----------

    x : scalar or array-like
        The value for which we compute the probability

    mean : scalar
        Mean of the Gaussian

    var : scalar
        Variance of the Gaussian

    norm : bool, default True
        Normalize the output if the input is an array of values.

    Returns
    -------

    probability : float
        probability of x for the Gaussian (mean, var). E.g. 0.101 denotes
        10.1%.

    Examples
    --------

    >>> gaussian(8, 1, 2)
    1.3498566943461957e-06

    >>> gaussian([8, 7, 9], 1, 2)
    array([1.34985669e-06, 3.48132630e-05, 3.17455867e-08])
    """[0m[1;33m
[0m[1;33m
[0m    [0mg[0m [1;33m=[0m [1;33m([0m[1;33m([0m[1;36m2[0m[1;33m*[0m[0mmath[0m[1;33m.[0m[0mpi[0m[1;33m*[0m[0mvar[0m[1;33m)[0m[1;33m**[0m[1;33m-[0m[1;36m.5[0m[1;33m)[0m [1;33m*[0m [0mnp[0m[1;33m.[0m[0mexp[0m[1;33m([0m[1;33m([0m[1;33m-[0m[1;36m0.5[0m[1;33m*[0m[1;33m([0m[0mnp[0m[1;33m.[0m[0masarray[0m[1;33m([0m[0mx[0m[1;33m)[0m[1;33m-[0m[0mmean[0m[1;33m)[0m[1;33m**[0m[1;36m2.[0m[1;33m)[0m [1;33m/[0m [0mvar[0m[1;33m)[0m[1;33m
[0m    [1;32mif[0m [0mnormed[0m [1;32mand[0m [0mlen[0m[1;33m([0m[0mnp[0m[1;33m.[0m[0mshape[0m[1;33m([0m[0mg[0m[1;33m)[0m[1;33m)[0m [1;33m>[0m [1;36m0[0m[1;33m:[0m[1;33m
[0m        [0mg[0m [1;33m=[0m [0mg[0m [1;33m/[0m [0msum[0m[1;33m([0m[0mg[0m[1;33m)[0m[1;33m
[0m[1;33m
[0m    [1;32mreturn[0m [0mg[0m[1;33m[0m[1;33m[0m[0m
[1;31mFile:[0m      s:\programs\anaconda3\lib\site-packages\filterpy\stats\stats.py
[1;31mType:[0m      function

Let’s plot a Gaussian with a mean of 22 ( μ = 22 ) (\mu=22) (μ=22), with a variance of 4 ( σ 2 = 4 ) (\sigma^2=4) (σ2=4).

plot_gaussian_pdf(22, 4, mean_line=True, xlabel='$^{\circ}C$');


[外链图片转存中…(img-1h4aYlne129e18a7a32ef8dbe2fee6644228064 ]

What does this curve mean? Assume we have a thermometer which reads 22°C. No thermometer is perfectly accurate, and so we expect that each reading will be slightly off the actual value. However, a theorem called Central Limit Theorem states that if we make many measurements that the measurements will be normally distributed. When we look at this chart we can see it is proportional to the probability of the thermometer reading a particular value given the actual temperature of 22°C.

Recall that a Gaussian distribution is continuous. Think of an infinitely long straight line - what is the probability that a point you pick randomly is at 2. Clearly 0%, as there is an infinite number of choices to choose from. The same is true for normal distributions; in the graph above the probability of being exactly 2°C is 0% because there are an infinite number of values the reading can take.

What is this curve? It is something we call the probability density function. The area under the curve at any region gives you the probability of those values. So, for example, if you compute the area under the curve between 20 and 22 the resulting area will be the probability of the temperature reading being between those two temperatures.

Here is another way to understand it. What is the density of a rock, or a sponge? It is a measure of how much mass is compacted into a given space. Rocks are dense, sponges less so. So, if you wanted to know how much a rock weighed but didn’t have a scale, you could take its volume and multiply by its density. This would give you its mass. In practice density varies in most objects, so you would integrate the local density across the rock’s volume.

M = ∭ R p ( x , y , z )   d V M = \iiint_R p(x,y,z)\, dV M=Rp(x,y,z)dV

We do the same with probability density. If you want to know the temperature being between 20°C and 21°C you would integrate the curve above from 20 to 21. As you know the integral of a curve gives you the area under the curve. Since this is a curve of the probability density, the integral of the density is the probability.

What is the probability of the temperature being exactly 22°C? Intuitively, 0. These are real numbers, and the odds of 22°C vs, say, 22.00000000000017°C is infinitesimal. Mathematically, what would we get if we integrate from 22 to 22? Zero.

Thinking back to the rock, what is the weight of an single point on the rock? An infinitesimal point must have no weight. It makes no sense to ask the weight of a single point, and it makes no sense to ask about the probability of a continuous distribution having a single value. The answer for both is obviously zero.

In practice our sensors do not have infinite precision, so a reading of 22°C implies a range, such as 22 ± \pm ± 0.1°C, and we can compute the probability of that range by integrating from 21.9 to 22.1.

We can think of this in Bayesian terms or frequentist terms. As a Bayesian, if the thermometer reads exactly 22°C, then our belief is described by the curve - our belief that the actual (system) temperature is near 22°C is very high, and our belief that the actual temperature is near 18 is very low. As a frequentist we would say that if we took 1 billion temperature measurements of a system at exactly 22°C, then a histogram of the measurements would look like this curve.

How do you compute the probability, or area under the curve? You integrate the equation for the Gaussian

∫ x 0 x 1 1 σ 2 π e − 1 2 ( x − μ ) 2 / σ 2 d x \int^{x_1}_{x_0} \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}{(x-\mu)^2}/\sigma^2 } dx x0x1σ2π 1e21(xμ)2/σ2dx

This is called the cumulative probability distribution, commonly abbreviated cdf.

I wrote filterpy.stats.norm_cdf which computes the integral for you. For example, we can compute

这条曲线是什么意思?假设我们有一个读数为 22°C 的温度计。没有温度计是完全准确的,因此我们预计每个读数都会略微偏离实际值。然而,一个名为 Central Limit Theorem(中心极限定理) 的定理指出,如果我们进行多次测量,则测量结果将服从正态分布。当我们查看此图表时,我们可以看到它与给定 22°C 的实际温度时温度计读取特定值的概率成正比。

回想一下,高斯分布是连续的。想一想一条无限长的直线——您随机选择的点为 2 的概率是多少。显然是 0%,因为有无限多的选择可供选择。正态分布也是如此;在上图中,精确 2°C 的概率为 0%,因为读数可以取无限多个值。

这条曲线是什么?我们称之为*概率密度函数。*任何区域的曲线下面积都会为您提供这些值的概率。因此,例如,如果您计算 20 和 22 之间的曲线下面积,则所得面积将是温度读数介于这两个温度之间的概率。

这是另一种理解它的方法。岩石或海绵的密度是多少?它衡量在给定空间中压缩了多少质量。岩石致密,海绵不那么致密。所以,如果你想知道一块石头的重量但没有秤,你可以用它的体积乘以它的密度。这会给你它的质量。在实践中,大多数物体的密度都不同,因此您可以对岩石体积的局部密度进行积分。

M = ∭ R p ( x , y , z )   d V M = \iiint_R p(x,y,z)\, dV M=Rp(x,y,z)dV

我们对概率密度做同样的事情。如果您想知道 20°C 和 21°C 之间的温度,您可以对上面的曲线从 20 积分到 21。如您所知,曲线的积分给出了曲线下的面积。由于这是概率密度的曲线,因此密度的积分就是概率。

温度恰好为 22°C 的概率是多少?直观地说,0。这些是实数,22°C 与 22.00000000000017°C 的几率是无穷小。从数学上讲,如果我们从 22 积分到 22,我们会得到什么?零。

回想一下岩石,岩石上单点的重量是多少?一个无穷小的点一定没有重量。问单点的权重是没有意义的,问连续分布有单值的概率是没有意义的。两者的答案显然都是零。

实际上,我们的传感器没有无限精度,因此 22°C 的读数意味着一个范围,例如 22 ± \pm ± 0.1°C,我们可以通过从 21.9 到 22.1 积分来计算该范围的概率。

我们可以用贝叶斯术语或频率论术语来考虑这一点。作为贝叶斯主义者,如果温度计的读数恰好是 22°C,那么我们的置信度就是用曲线来描述的——我们认为实际(系统)温度接近 22°C 的置信度非常高,而我们相信实际温度接近 18°C很低。作为频率论者,我们会说,如果我们在恰好 22°C 下对系统进行 10 亿次温度测量,那么测量值的直方图将类似于这条曲线。

您如何计算概率或曲线下面积?您对高斯方程进行积分

∫ x 0 x 1 1 σ 2 π e − 1 2 ( x − μ ) 2 / σ 2 d x \int^{x_1}_{x_0} \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}{(x-\mu)^2} /\sigma^2 } dx x0x1σ2π 1e21(xμ)2/σ2dx

这称为累积概率分布,通常缩写为cdf

我写了 filterpy.stats.norm_cdf 来为你计算积分。例如,我们可以计算

from filterpy.stats import norm_cdf
print('Cumulative probability of range 21.5 to 22.5 is {:.2f}%'.format(
      norm_cdf((21.5, 22.5), 22,4)*100))
print('Cumulative probability of range 23.5 to 24.5 is {:.2f}%'.format(
      norm_cdf((23.5, 24.5), 22,4)*100))
Cumulative probability of range 21.5 to 22.5 is 19.74%
Cumulative probability of range 23.5 to 24.5 is 12.10%

The mean ( μ \mu μ) is what it sounds like — the average of all possible probabilities. Because of the symmetric shape of the curve it is also the tallest part of the curve. The thermometer reads 22°C, so that is what we used for the mean.

The notation for a normal distribution for a random variable X X X is X ∼   N ( μ , σ 2 ) X \sim\ \mathcal{N}(\mu,\sigma^2) X N(μ,σ2) where ∼ \sim means distributed according to. This means I can express the temperature reading of our thermometer as

temp ∼ N ( 22 , 4 ) \text{temp} \sim \mathcal{N}(22,4) tempN(22,4)

This is an extremely important result. Gaussians allow me to capture an infinite number of possible values with only two numbers! With the values μ = 22 \mu=22 μ=22 and σ 2 = 4 \sigma^2=4 σ2=4 I can compute the distribution of measurements over any range.

Some sources use N ( μ , σ ) \mathcal N (\mu, \sigma) N(μ,σ) instead of N ( μ , σ 2 ) \mathcal N (\mu, \sigma^2) N(μ,σ2). Either is fine, they are both conventions. You need to keep in mind which form is being used if you see a term such as N ( 22 , 4 ) \mathcal{N}(22,4) N(22,4). In this book I always use N ( μ , σ 2 ) \mathcal N (\mu, \sigma^2) N(μ,σ2), so σ = 2 \sigma=2 σ=2, σ 2 = 4 \sigma^2=4 σ2=4 for this example.

平均值( μ \mu μ)就是它听起来的样子——所有可能概率的平均值。由于曲线的对称形状,它也是曲线的最高部分。温度计读数为 22°C,这就是我们使用的平均值。随机变量 X X X 的正态分布符号是 X ∼   N ( μ , σ 2 ) X \sim\ \mathcal{N}(\mu,\sigma^2) X N(μ,σ2) 其中 ∼ \sim 表示分布根据。这意味着我可以将温度计的温度读数表示为 temp ∼ N ( 22 , 4 ) \text{temp} \sim \mathcal{N}(22,4) tempN(22,4) 这是一个极其重要的结果。高斯让我只用两个数字就可以捕捉到无限多的可能值!使用值 μ = 22 \mu=22 μ=22 σ 2 = 4 \sigma^2=4 σ2=4 我可以计算任何范围内的测量值分布。

一些来源使用 N ( μ , σ ) \mathcal N (\mu, \sigma) N(μ,σ) 而不是 N ( μ , σ 2 ) \mathcal N (\mu, \sigma^2) N(μ,σ2)。两者都可以,它们都是约定俗成的。如果您看到诸如 N ( 22 , 4 ) \mathcal{N}(22,4) N(22,4) 之类的项,您需要记住正在使用哪种形式。在这本书中我总是使用 N ( μ , σ 2 ) \mathcal N (\mu, \sigma^2) N(μ,σ2),所以对于这个例子来说 σ = 2 \sigma=2 σ=2, σ 2 = 4 \sigma^2=4 σ2=4

The Variance and Belief

Since this is a probability density distribution it is required that the area under the curve always equals one. This should be intuitively clear — the area under the curve represents all possible outcomes, something happened, and the probability of something happening is one, so the density must sum to one. We can prove this ourselves with a bit of code. (If you are mathematically inclined, integrate the Gaussian equation from − ∞ -\infty to ∞ \infty )

方差和置信度

由于这是概率密度分布,因此要求曲线下的面积始终等于 1。这在直觉上应该很清楚——曲线下的面积代表所有可能的结果,某事发生了,某事发生的概率是一,所以密度之和必须为一。我们可以用一些代码自己证明这一点。 (如果你有数学倾向,将高斯方程从 − ∞ -\infty 积分到 ∞ \infty

print(norm_cdf((-1e8, 1e8), mu=0, var=4))
1.0

This leads to an important insight. If the variance is small the curve will be narrow. this is because the variance is a measure of how much the samples vary from the mean. To keep the area equal to 1, the curve must also be tall. On the other hand if the variance is large the curve will be wide, and thus it will also have to be short to make the area equal to 1.

Let’s look at that graphically. We will use the aforementioned filterpy.stats.gaussian which can take either a single value or array of values.

这导致了一个重要的见解。如果方差很小,曲线就会很窄。这是因为方差是衡量样本与平均值的多少的量度。要保持面积等于 1,曲线也必须很高。另一方面,如果方差很大,曲线就会很宽,因此也必须很短才能使面积等于 1。

让我们以图形方式看一下。我们将使用前面提到的“filterpy.stats.gaussian”,它可以采用单个值或值数组。

from filterpy.stats import gaussian

print(gaussian(x=3.0, mean=2.0, var=1))
print(gaussian(x=[3.0, 2.0], mean=2.0, var=1))
0.24197072451914337
[0.378 0.622]

By default gaussian normalizes the output, which turns the output back into a probability distribution. Use the argumentnormed to control this.

默认情况下,“gaussian”对输出进行归一化,从而将输出变回概率分布。使用参数normed 来控制它。

print(gaussian(x=[3.0, 2.0], mean=2.0, var=1, normed=False))
[0.242 0.399]

If the Gaussian is not normalized it is called a Gaussian function instead of Gaussian distribution.

如果高斯未归一化,则称为高斯函数而不是高斯分布

xs = np.arange(15, 30, 0.05)
plt.plot(xs, gaussian(xs, 23, 0.2**2), label='$\sigma^2=0.2^2$')
plt.plot(xs, gaussian(xs, 23, .5**2), label='$\sigma^2=0.5^2$', ls=':')
plt.plot(xs, gaussian(xs, 23, 1**2), label='$\sigma^2=1^2$', ls='--')
plt.legend();


[外链图片转存中…(img-jr15b6479ffc946d26d426515698018cc.png)

What is this telling us? The Gaussian with σ 2 = 0. 2 2 \sigma^2=0.2^2 σ2=0.22 is very narrow. It is saying that we believe x = 23 x=23 x=23, and that we are very sure about that: within ± 0.2 \pm 0.2 ±0.2 std. In contrast, the Gaussian with σ 2 = 1 2 \sigma^2=1^2 σ2=12 also believes that x = 23 x=23 x=23, but we are much less sure about that. Our belief that x = 23 x=23 x=23 is lower, and so our belief about the likely possible values for x x x is spread out — we think it is quite likely that x = 20 x=20 x=20 or x = 26 x=26 x=26, for example. σ 2 = 0. 2 2 \sigma^2=0.2^2 σ2=0.22 has almost completely eliminated 22 22 22 or 24 24 24 as possible values, whereas σ 2 = 1 2 \sigma^2=1^2 σ2=12 considers them nearly as likely as 23 23 23.

If we think back to the thermometer, we can consider these three curves as representing the readings from three different thermometers. The curve for σ 2 = 0. 2 2 \sigma^2=0.2^2 σ2=0.22 represents a very accurate thermometer, and curve for σ 2 = 1 2 \sigma^2=1^2 σ2=12 represents a fairly inaccurate one. Note the very powerful property the Gaussian distribution affords us — we can entirely represent both the reading and the error of a thermometer with only two numbers — the mean and the variance.

An equivalent formation for a Gaussian is N ( μ , 1 / τ ) \mathcal{N}(\mu,1/\tau) N(μ,1/τ) where μ \mu μ is the mean and τ \tau τ the precision. 1 / τ = σ 2 1/\tau = \sigma^2 1/τ=σ2; it is the reciprocal of the variance. While we do not use this formulation in this book, it underscores that the variance is a measure of how precise our data is. A small variance yields large precision — our measurement is very precise. Conversely, a large variance yields low precision — our belief is spread out across a large area. You should become comfortable with thinking about Gaussians in these equivalent forms. In Bayesian terms Gaussians reflect our belief about a measurement, they express the precision of the measurement, and they express how much variance there is in the measurements. These are all different ways of stating the same fact.

I’m getting ahead of myself, but in the next chapters we will use Gaussians to express our belief in things like the estimated position of the object we are tracking, or the accuracy of the sensors we are using.

这告诉我们什么? σ 2 = 0. 2 2 \sigma^2=0.2^2 σ2=0.22 的高斯分布非常窄。它是说我们相信 x = 23 x=23 x=23,并且我们对此非常确定:在 ± 0.2 \pm 0.2 ±0.2 std 之内。相比之下, σ 2 = 1 2 \sigma^2=1^2 σ2=12 的高斯也相信 x = 23 x=23 x=23,但我们对此不太确定。我们认为 x = 23 x=23 x=23 较低,因此我们对 x x x 的可能值的看法被分散了——例如,我们认为很可能 x = 20 x=20 x=20 x = 26 x=26 x=26 σ 2 = 0. 2 2 \sigma^2=0.2^2 σ2=0.22 几乎完全排除了 22 22 22 24 24 24 作为可能值,而 σ 2 = 1 2 \sigma^2=1^2 σ2=12 认为它们几乎与 23 23 23 一样可能。

如果我们回想一下温度计,我们可以将这三条曲线视为代表三个不同温度计的读数。 σ 2 = 0. 2 2 \sigma^2=0.2^2 σ2=0.22 的曲线代表一个非常准确的温度计, σ 2 = 1 2 \sigma^2=1^2 σ2=12 的曲线代表一个相当不准确的温度计。请注意高斯分布为我们提供的非常强大的属性——我们可以完全用两个数字——均值和方差来表示温度计的读数和误差。

高斯的等效形式是 N ( μ , 1 / τ ) \mathcal{N}(\mu,1/\tau) N(μ,1/τ),其中 μ \mu μ均值 τ \tau τ精度 1 / τ = σ 2 1/\tau = \sigma^2 1/τ=σ2;它是方差的倒数。虽然我们在本书中没有使用这个公式,但它强调了方差是衡量数据精确度的指标。小方差产生大精度——我们的测量非常精确。相反,较大的方差会产生较低的精度——我们的置信度分布在一个很大的区域。您应该习惯以这些等价形式思考高斯分布。在贝叶斯术语中,高斯反映了我们对测量的置信度,它们表达了测量的精度,并且表达了测量中有多少方差。这些都是陈述同一事实的不同方式。

我有点超前了,但在接下来的章节中,我们将使用高斯来表达我们对诸如我们正在跟踪的对象的估计位置或我们正在使用的传感器的准确性等事物的置信度。

The 68-95-99.7 Rule

It is worth spending a few words on standard deviation now. The standard deviation is a measure of how much the data deviates from the mean. For Gaussian distributions, 68% of all the data falls within one standard deviation ( ± 1 σ \pm1\sigma ±1σ) of the mean, 95% falls within two standard deviations ( ± 2 σ \pm2\sigma ±2σ), and 99.7% within three ( ± 3 σ \pm3\sigma ±3σ). This is often called the 68-95-99.7 rule. If you were told that the average test score in a class was 71 with a standard deviation of 9.4, you could conclude that 95% of the students received a score between 52.2 and 89.8 if the distribution is normal (that is calculated with 71 ± ( 2 ∗ 9.4 ) 71 \pm (2 * 9.4) 71±(29.4)).

Finally, these are not arbitrary numbers. If the Gaussian for our position is μ = 22 \mu=22 μ=22 meters, then the standard deviation also has units meters. Thus σ = 0.2 \sigma=0.2 σ=0.2 implies that 68% of the measurements range from 21.8 to 22.2 meters. Variance is the standard deviation squared, thus σ 2 = . 04 \sigma^2 = .04 σ2=.04 meters 2 ^2 2. As you saw in the last section, writing σ 2 = 0. 2 2 \sigma^2 = 0.2^2 σ2=0.22 can make this somewhat more meaningful, since the 0.2 is in the same units as the data.

The following graph depicts the relationship between the standard deviation and the normal distribution.

68-95-99.7 规则

现在值得在标准差上多说几句。标准差是衡量数据偏离均值多少的指标。对于高斯分布,68% 的数据落在均值的一个标准差( ± 1 σ \pm1\sigma ±1σ)以内,95% 落在均值的两个标准差( ± 2 σ \pm2\sigma ±2σ)以内,99.7% 落在均值的三个标准差( ± 3 σ \pm3\sigma ±3σ)以内。这通常称为 68-95-99.7 规则。如果你被告知一个班级的平均考试分数是 71,标准差是 9.4,你可以得出结论,95% 的学生的分数在 52.2 到 89.8 之间,如果分布是正态分布(用 71 ± ( 2 ∗ 9.4 ) 71 \pm (2 * 9.4) 71±(29.4))。

最后,这些不是任意数字。如果我们位置的高斯分布是 μ = 22 \mu=22 μ=22 米,那么标准差也有单位米。因此 σ = 0.2 \sigma=0.2 σ=0.2 意味着 68% 的测量值范围是 21.8 到 22.2 米。方差是标准偏差的平方,因此 σ 2 = . 04 \sigma^2 = .04 σ2=.04 2 ^2 2。正如您在上一节中看到的那样,编写 σ 2 = 0. 2 2 \sigma^2 = 0.2^2 σ2=0.22 可以使这更有意义,因为 0.2 与数据的单位相同。

下图描述了标准偏差与正态分布之间的关系。

from kf_book.gaussian_internal import display_stddev_plot
display_stddev_plot()


[外链图片转存中…(imgced7be14effac2b782023aa307c66fd.png)

Interactive Gaussians

For those that are reading this in a Jupyter Notebook, here is an interactive version of the Gaussian plots. Use the sliders to modify μ \mu μ and σ 2 \sigma^2 σ2. Adjusting μ \mu μ will move the graph to the left and right because you are adjusting the mean, and adjusting σ 2 \sigma^2 σ2 will make the bell curve thicker and thinner.

对于那些在 Jupyter Notebook 中阅读本文的人,这里是高斯图的交互式版本。使用滑块修改 μ \mu μ σ 2 \sigma^2 σ2。调整 μ \mu μ 会使图形向左和向右移动,因为你在调整均值,而调整 σ 2 \sigma^2 σ2 会使钟形曲线变粗变细。

import math
from ipywidgets import interact, FloatSlider

def plt_g(mu,variance):
    plt.figure()
    xs = np.arange(2, 8, 0.01)
    ys = gaussian(xs, mu, variance)
    plt.plot(xs, ys)
    plt.ylim(0, 0.04)
    plt.show()

interact(plt_g, mu=FloatSlider(value=5, min=3, max=7),
         variance=FloatSlider(value = .03, min=.01, max=1.));
interactive(children=(FloatSlider(value=5.0, description='mu', max=7.0, min=3.0), FloatSlider(value=0.03, desc…

Finally, if you are reading this online, here is an animation of a Gaussian. First, the mean is shifted to the right. Then the mean is centered at μ = 5 \mu=5 μ=5 and the variance is modified.

最后,如果您正在在线阅读这篇文章,这里有一个高斯动画。首先,均值向右移动。然后均值以 μ = 5 \mu=5 μ=5为中心,修改方差。

Computational Properties of Normally Distributed Random Variables

The discrete Bayes filter works by multiplying and adding arbitrary probability random variables. The Kalman filter uses Gaussians instead of arbitrary random variables, but the rest of the algorithm remains the same. This means we will need to multiply and add Gaussian random variables (Gaussian random variable is just another way to say normally distributed random variable).

A remarkable property of Gaussian random variables is that the sum of two independent Gaussian random variables is also normally distributed! The product is not Gaussian, but proportional to a Gaussian. There we can say that the result of multipying two Gaussian distributions is a Gaussian function (recall function in this context means that the property that the values sum to one is not guaranteed).

Wikipedia has a good article on this property, and I also prove it at the end of this chapter.
https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables

Before we do the math, let’s test this visually.

正态分布随机变量的计算特性

离散贝叶斯过滤器通过乘以和添加任意概率随机变量来工作。卡尔曼滤波器使用高斯而不是任意随机变量,但算法的其余部分保持不变。这意味着我们需要乘以和添加高斯随机变量(高斯随机变量只是正态分布随机变量的另一种说法)。高斯随机变量的一个显着特性是两个独立的高斯随机变量之和也服从正态分布!乘积不是高斯分布的,而是与高斯分布成正比的。在那里我们可以说两个高斯分布相乘的结果是一个高斯函数(这里的召回函数意味着值和为一的属性是不能保证的)。

维基百科上有一篇关于这个性质的好文章,我也在本章末尾证明了这一点。 https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables

在我们做数学之前,让我们直观地测试一下。

x = np.arange(-1, 3, 0.01)
g1 = gaussian(x, mean=0.8, var=.1)
g2 = gaussian(x, mean=1.3, var=.2)
plt.plot(x, g1, x, g2)

g = g1 * g2  # element-wise multiplication
g = g / sum(g)  # normalize
plt.plot(x, g, ls='-.');


[外链图片转存中…(img-bHq5AgCw-1686664f669993b2c5ac402ae164e8c5])

Here I created two Gaussians, g1= N ( 0.8 , 0.1 ) \mathcal N(0.8, 0.1) N(0.8,0.1) and g2= N ( 1.3 , 0.2 ) \mathcal N(1.3, 0.2) N(1.3,0.2) and plotted them. Then I multiplied them together and normalized the result. As you can see the result looks like a Gaussian distribution.

Gaussians are nonlinear functions. Typically, if you multiply a nonlinear equations you end up with a different type of function. For example, the shape of multiplying two sins is very different from sin(x).

在这里,我创建了两个高斯函数 g1= N ( 0.8 , 0.1 ) \mathcal N(0.8, 0.1) N(0.8,0.1) 和 g2= N ( 1.3 , 0.2 ) \mathcal N(1.3, 0.2) N(1.3,0.2) 并绘制了它们。然后我将它们相乘并对结果进行归一化。如您所见,结果看起来像高斯分布。

高斯分布是非线性函数。通常,如果将非线性方程相乘,最终会得到不同类型的函数。例如,两个 sin 相乘的形状与 sin(x) 非常不同。

x = np.arange(0, 4*np.pi, 0.01)
plt.plot(np.sin(1.2*x))
plt.plot(np.sin(1.2*x) * np.sin(2*x));


[外链图片转存中…(img-ki77057e82893a880fdf380c60baa34844.png)

But the result of multiplying two Gaussians distributions is a Gaussian function. This is a key reason why Kalman filters are computationally feasible. Said another way, Kalman filters use Gaussians because they are computationally nice.

The product of two independent Gaussians is given by:

μ = σ 1 2 μ 2 + σ 2 2 μ 1 σ 1 2 + σ 2 2 σ 2 = σ 1 2 σ 2 2 σ 1 2 + σ 2 2 \begin{aligned}\mu &=\frac{\sigma_1^2\mu_2 + \sigma_2^2\mu_1}{\sigma_1^2+\sigma_2^2}\\ \sigma^2 &=\frac{\sigma_1^2\sigma_2^2}{\sigma_1^2+\sigma_2^2} \end{aligned} μσ2=σ12+σ22σ12μ2+σ22μ1=σ12+σ22σ12σ22

The sum of two Gaussian random variables is given by

μ = μ 1 + μ 2 σ 2 = σ 1 2 + σ 2 2 \begin{gathered}\mu = \mu_1 + \mu_2 \\ \sigma^2 = \sigma^2_1 + \sigma^2_2 \end{gathered} μ=μ1+μ2σ2=σ12+σ22

At the end of the chapter I derive these equations. However, understanding the deriviation is not very important.

但是两个高斯分布相乘的结果是一个高斯函数。这是卡尔曼滤波器在计算上可行的关键原因。换句话说,卡尔曼滤波器使用高斯 因为它们在计算上很好。

两个独立高斯分布的乘积为: μ = σ 1 2 μ 2 + σ 2 2 μ 1 σ 1 2 + σ 2 2    σ 2 = σ 1 2 σ 2 2 σ 1 2 + σ 2 2 \begin{aligned}\mu &=\frac{\sigma_1^2\mu_2 + \sigma_2^2\mu_1}{\sigma_1^2+\sigma_2^2}\ \ \sigma^2 &=\frac{\sigma_1^2\sigma_2^2}{\sigma_1^2+\sigma_2^2} \end{aligned} μ=σ12+σ22σ12μ2+σ22μ1  σ2=σ12+σ22σ12σ22

两个高斯随机变量之和由

μ = μ 1 + μ 2 σ 2 = σ 1 2 + σ 2 2 \begin{gathered}\mu = \mu_1 + \mu_2 \\ \sigma^2 = \sigma^2_1 + \sigma^2_2 \end{gathered} μ=μ1+μ2σ2=σ12+σ22

在本章的末尾,我推导出了这些等式。然而,理解推导并不是很重要。

Putting it all Together

Now we are ready to talk about how Gaussians can be used in filtering. In the next chapter we will implement a filter using Gaussins. Here I will explain why we would want to use Gaussians.

In the previous chapter we represented probability distributions with an array. We performed the update computation by computing the element-wise product of that distribution with another distribution representing the likelihood of the measurement at each point, like so:

总结

我们准备讨论如何在过滤中使用高斯分布。在下一章中,我们将使用高斯函数实现一个滤波器。在这里我将解释为什么我们要使用高斯。
在上一章中,我们用数组表示概率分布。我们通过计算该分布与代表每个点测量可能性的另一个分布的逐元素乘积来执行更新计算,如下所示:

def normalize(p):
    return p / sum(p)

def update(likelihood, prior):
    return normalize(likelihood * prior)

prior =      normalize(np.array([4, 2, 0, 7, 2, 12, 35, 20, 3, 2]))
likelihood = normalize(np.array([3, 4, 1, 4, 2, 38, 20, 18, 1, 16]))
posterior = update(likelihood, prior)
book_plots.bar_plot(posterior)


[外链图片转存中…(img-8N73bBn26417b30435e0d04fb13e01a87266)]

In other words, we have to compute 10 multiplications to get this result. For a real filter with large arrays in multiple dimensions we’d require billions of multiplications, and vast amounts of memory.

But this distribution looks like a Gaussian. What if we use a Gaussian instead of an array? I’ll compute the mean and variance of the posterior and plot it against the bar chart.

换句话说,我们必须计算 10 次乘法才能得到这个结果。对于具有多个维度的大型数组的真实过滤器,我们需要数十亿次乘法和大量内存。

但是这个分布看起来像高斯分布。如果我们使用高斯而不是数组会怎么样?我将计算后验的均值和方差并将其绘制在条形图上。

xs = np.arange(0, 10, .01)

def mean_var(p):
    x = np.arange(len(p))
    mean = np.sum(p * x,dtype=float)
    var = np.sum((x - mean)**2 * p)
    return mean, var

mean, var = mean_var(posterior)
book_plots.bar_plot(posterior)
plt.plot(xs, gaussian(xs, mean, var, normed=False), c='r');
print('mean: %.2f' % mean, 'var: %.2f' % var)
mean: 5.88 var: 1.24

[外链图片转存中…(img-KYi482f3031ad706c47e0775d6faf275de.png)

This is impressive. We can describe an entire distribution of numbers with only two numbers. Perhaps this example is not persuasive, given there are only 10 numbers in the distribution. But a real problem could have millions of numbers, yet still only require two numbers to describe it.

Next, recall that our filter implements the update function with

def update(likelihood, prior):
    return normalize(likelihood * prior)

If the arrays contain a million elements, that is one million multiplications. However, if we replace the arrays with a Gaussian then we would perform that calculation with

μ = σ 1 2 μ 2 + σ 2 2 μ 1 σ 1 2 + σ 2 2 σ 2 = σ 1 2 σ 2 2 σ 1 2 + σ 2 2 \begin{aligned}\mu &=\frac{\sigma_1^2\mu_2 + \sigma_2^2\mu_1}{\sigma_1^2+\sigma_2^2}\\ \sigma^2 &=\frac{\sigma_1^2\sigma_2^2}{\sigma_1^2+\sigma_2^2} \end{aligned} μσ2=σ12+σ22σ12μ2+σ22μ1=σ12+σ22σ12σ22

which is three multiplications and two divisions.

这令人印象深刻。我们可以仅用两个数字来描述整个数字分布。也许这个例子没有说服力,因为分布中只有 10 个数字。但一个真正的问题可能有数百万个数字,但仍然只需要两个数字来描述它。

接下来,回想一下我们的过滤器使用

def update(likelihood, prior):
     return normalize(likelihood * prior) 

如果数组包含一百万个元素,那就是一百万次乘法。

但是,如果我们用高斯替换数组,那么我们将使用 μ = σ 1 2 μ 2 + σ 2 2 μ 1 σ 1 2 + σ 2 2 σ 2 = σ 1 2 σ 2 2 σ 1 2 + σ 2 2 \begin{aligned}\mu &=\frac{\sigma_1^2\mu_2 + \sigma_2^2\mu_1}{\sigma_1^2+\sigma_2^2}\\ \sigma^2 &=\frac{\sigma_1^2\sigma_2^2}{\sigma_1^2+\sigma_2^2} \end{aligned} μσ2=σ12+σ22σ12μ2+σ22μ1=σ12+σ22σ12σ22 执行该计算

这是三个乘法和两个除法。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值