数理统计的统计量分布t分布_t分布:啤酒厂发现的关键统计概念

本文介绍了数理统计中的t分布,它是一个在数据科学、统计学和机器学习中常见的概率分布。t分布是由Guinness啤酒厂的一位员工William Gosset在解决小样本方差未知问题时提出的,与高斯分布密切相关。t分布通常在正态分布的方差未知或样本量小时使用,是t检验的基础。高斯分布则是许多机器学习算法的基础,例如线性模型和变分自编码器。
摘要由CSDN通过智能技术生成

数理统计的统计量分布t分布

by Kirill Dubovikov

通过基里尔·杜博维科夫(Kirill Dubovikov)

t分布:啤酒厂发现的关键统计概念 (The t-distribution: a key statistical concept discovered by a beer brewery)

In this post we will look at two probability distributions you will encounter almost each time you do data science, statistics, or machine learning.

在这篇文章中,我们将研究几乎每次您进行数据科学,统计学或机器学习时都会遇到的两种概率分布。

高斯分布 (Gaussian distribution)

Imagine that we are doing a research on the height of various people in a city. We go down the street and measure a bunch of random people. (Some of them thought this was quite strange and wanted to call the police, but come on, this is for the science!)

想象一下,我们正在研究一个城市中不同人群的身高。 我们走在街上,测量一群随机的人。 (他们中的一些人认为这很奇怪,因此想报警。但是,这是出于科学上的理由!)

Now we decide that some Exploratory Data Analysis won’t hurt. But statistical software like R isn’t available at the moment, so we just make a histogram out of people.

现在,我们确定一些探索性数据分析不会受到伤害。 但是目前尚无法使用R之类的统计软件,因此我们只是根据人们制作直方图。

What do we see here? Ahh, the famous bell curve. This is likely to be the most important probability distribution you will ever encounter. Thanks to the Central Limit Theorem, the Gaussian distribution is present in many real world phenomena. It’s so common that people just call it a normal distribution.

我们在这里看到什么? 啊,著名的钟形曲线。 这可能是您将遇到的最重要的概率分布。 多亏了中心极限定理 ,高斯分布存在于许多现实世界中。 如此普遍以至于人们只称其为正态分布

The Central Limit Theorem states that arithmetic mean of a sufficiently large number of independent random variables will be normally distributed. Those random variables can have any distribution initially. But when we measure something that is represented by their sum, we will eventually (as the number of samples tends to ) end up with normally distributed process.

中心极限定理指出,足够多的独立随机变量的算术平均值将呈正态分布。 这些随机变量最初可以具有任何分布。 但是,当我们测量用它们的和表示的东西时,我们最终(由于样本数趋于 )最终将以正态分布过程结束。

The probability density function of Gaussian distribution is written below:

高斯分布的概率密度函数写为:

This formula may look a bit intimidating, but it’s convenient to work with mathematically. If you’re interested in how it can be derived, you can read how here. As you can see this distribution has two parameters:

这个公式可能看起来有些吓人,但是数学上使用起来很方便。 如果您对如何派生感兴趣,可以在这里阅读 。 如您所见,此分布具有两个参数:

  • µ (mean)

    µ(均值)
  • σ(standard deviation).

    σ(标准偏差)。

Mean µ controls the expected value (where the most values will go) of a normally distributed random variable. Variance σ² controls the spread or variety of possible values under the distribution.

均值µ控制正态分布的随机变量的期望值 (最多的值将到达该值)。 方差σ²控制分布下可能值的分布或变化。

The concept of a normal distribution has immense value in machine learning. A great variety of machine learning algorithms use it extensively:

正态分布的概念在机器学习中具有巨大的价值。 各种各样的机器学习算法广泛使用它:

  • Linear models assume that errors are normally distributed

    线性模型假设误差是正态分布的
  • Gaussian processes assume that all values of a function under the model are distributed normally

    高斯过程假设模型下函数的所有值均呈正态分布
  • Gaussian mixtures let you model complex distributions and build classifiers on top of mixture models

    高斯混合可让您对复杂的分布进行建模,并在混合模型的基础上建立分类器
  • Normal distribution comes up as one of the main components in Variational Autoencoders

    正态分布是变分自动编码器的主要组成部分之一

Here is an interactive demo of the Gaussian distribution.

这是高斯分布的交互式演示。

学生的t分布 (A student’s t-distribution)

What if we wanted to model our data with Gaussian distribution, but the variance σ² is was not known to us? This problem arises when the sample sizes are small and standard deviation (σ) can not be estimated accurately.

如果我们想用高斯分布对数据建模,但我们不知道方差σ²怎么办? 当样本量较小且无法准确估算标准偏差(σ)时,会出现此问题。

William Gosset tackled this problem while working at a Guinness brewery. He empirically found a formula for a t-distributed random variable.

威廉·高塞特(William Gosset)在吉尼斯啤酒厂工作时解决了这个问题。 他根据经验找到了t分布随机变量的公式。

First, suppose we have values x, …, xn which were sampled from some normal distribution N(µ, σ²).

首先,假设我们有x,…,xn它们是从某些正态分布N(µ,σ²)中采样的。

We do not know the true variance, but we can estimate it by calculating sample mean and variance:

我们不知道真正的方差,但是我们可以通过计算样本均值和方差来估计它:

Then the random variable

然后是随机变量

will have a t-distribution with n-1 degrees of freedom, where n is the number of samples.

将具有n-1个自由度的t分布,其中n是样本数。

This formula may resemble transformation from Normal to Standard Normal (a shorthand for Normal distribution with zero mean and unit variance):

此公式可能类似于从正态到标准正态的转换(均值和单位方差为零的正态分布的简写):

We don’t know the true population variance, so we have to substitute sample standard deviation estimate for the real one.

我们不知道真实的总体方差,因此我们必须用样本标准差估计值代替真实的估计值。

This distribution lies at the foundation of the scientific method, called the t-test. This was used at Guinness to measure the quality of their beer.

这种分布是称为t检验的科学方法的基础。 吉尼斯(Guinness)用它来测量啤酒的质量。

William Gosset published this result under a pseudonym Student. Guinness was afraid that its competitors would discover that the t-test was used to control the quality of their product.

William Gosset以化名Student公布了此结果。 吉尼斯(Guinness)担心竞争对手会发现t检验用于控制产品质量。

Gosset’s discoveries were later formalized by famous statistician Ronald Fisher. Fisher is considered to be the author of the frequentist approach to statistics.

戈塞特的发现后来由著名的统计学家罗纳德·费舍尔(Ronald Fisher)正式化。 费舍尔(Fisher)被认为是统计学的常识性方法的作者。

Now goes the fun part! You can play with t-distribution below:

现在开始有趣的部分! 您可以在下面使用t分布进行游戏:

As you can see t-distribution approaches standard normal when degrees of freedom are large. This happens because sample mean approaches true mean as a number of samples approaches infinity. The “fat” tails of t-distribution compensate for uncertainty when we are working with small samples.

如您所见,自由度较大时,t分布接近标准正态。 发生这种情况的原因是,当多个样本接近无穷大时,样本均值趋于真实均值。 当我们处理小样本时,t分布的“胖尾”弥补了不确定性。

An interested reader might ask, “So, what is the probability density function of the t-distribution? How can we derive it?” This turns out to be not that easy in terms of mathematics, but the central idea is easy to grasp.

感兴趣的读者可能会问:“那么,t分布的概率密度函数是什么? 我们如何得出呢?” 事实证明,这在数学上并不是那么容易,但是中心思想很容易掌握。

Let’s suppose we are interested in getting the probability density function of normal variable X ~ N(0, σ). But without direct dependence on standard deviation σ.

假设我们对获取正态变量X〜N(0,σ)的概率密度函数感兴趣 但不直接依赖于标准偏差σ。

Intuitively, to get rid of σ we must make some assumptions. Let’s treat σ as a random variable itself, and assume that it follows Gamma distribution (this is a very general distribution which has many uses in Bayesian statistics).

凭直觉,要摆脱σ,我们必须做一些假设。 让我们将σ本身当作一个随机变量,并假定它遵循Gamma分布 (这是一种非常通用的分布,在贝叶斯统计中有很多用途)。

This way we may say that X is a mixture of two continuous probability distributions: Normal and Gamma. Then we integrate out σ and arrive at the probability density function formula for the t-distribution.

这样,我们可以说X是两个连续概率分布的混合:正态和伽玛。 然后,我们对σ进行积分,得出t分布的概率密度函数公式。

You can see more formal proofs here and here.

您可以在这里这里看到更多正式证明。

结论 (Conclusion)

Gaussian distributions and Student’s distributions are some of the most important continuous probability distributions in statistics and machine learning.

高斯分布和学生分布是统计和机器学习中最重要的连续概率分布。

The t-distribution may be used as a placeholder for Gaussian when population variance is not known, or when the sample size is small. Both are closely related to each other in a strict and formal way.

当总体方差未知或样本量较小时,可以将t分布用作高斯的占位符。 两者以严格和正式的方式彼此密切联系。

Thanks for reading my article! I hope it helped you to learn something new or refresh existing knowledge.

感谢您阅读我的文章! 我希望它能帮助您学习新知识或刷新现有知识。

翻译自: https://www.freecodecamp.org/news/the-t-distribution-a-key-statistical-concept-discovered-by-a-beer-brewery-dbfdc693184/

数理统计的统计量分布t分布

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值