统计概率分布_概率统计中的重要分布

最新推荐文章于 2024-08-13 18:29:56 发布

weixin_26752765

最新推荐文章于 2024-08-13 18:29:56 发布

阅读量1.7k

点赞数 1

文章标签： python 机器学习人工智能算法大数据

原文链接：https://medium.com/analytics-vidhya/important-distributions-in-probability-statistics-a868283fa127

版权

统计概率分布

Random Variables follow different types of distribution in probability space which decides their behaviour and helps in predictions.

随机变量在概率空间中遵循不同的分布类型，这决定了它们的行为并有助于预测。

Table of contents:

目录：

Introduction
介绍
Gaussian/Normal Distribution
高斯/正态分布
Binomial Distribution
二项分布
Bernoulli Distribution
伯努利分布
Log Normal Distribution
对数正态分布
Power Law Distribution
幂律分布
Uses of Distributions
发行用途

介绍 (Introduction)

Whenever we come across any experiment in probability, we talk about random variable which is nothing but the variable which takes the expected outcomes of that experiment. For example, when we roll a dice, we expect a value from the set {1,2,3,4,5,6}. So we define a random variable X which takes these values every time we roll.

每当我们有机会遇到任何实验时，我们都在谈论随机变量，它不过是采用该实验的预期结果的变量。例如，当我们掷骰子时，我们期望值{1,2,3,4,5,6}中的值。因此，我们定义了一个随机变量X，每次滚动时它都会使用这些值。

Depending upon the experiment, the random variable can take either discrete values or continuous values. So this dice example is of discrete random variable as it takes a discrete value. But suppose we are talking about the price of houses of a particular town then the associated random variable can take continuous values (e.g. $550,000, $1,200,523.54, etc).

根据实验，随机变量可以采用离散值或连续值 。因此，此骰子示例具有离散随机变量，因为它具有离散值。但是，假设我们正在谈论特定城镇的房屋价格，那么相关的随机变量可以采用连续值(例如$ 550,000，$ 1,200,523.54等)。

When we plot these expected values of random variable vs. the frequency of there appearance in an experiment, we get a frequency distribution plot in form of histograms. After using kernel Density Estimation for smoothing these histograms, we get a fine curve. This curve is referred as “Distribution”.

当我们在实验中绘制这些随机变量的期望值与出现频率的关系时，我们会得到直方图形式的频率分布图 。使用核密度估计对这些直方图进行平滑处理后，我们得到了一条细曲线。该曲线称为“ 分布 ”。

高斯/正态分布 (Gaussian/Normal Distribution)

Gaussian/Normal distribution is a continuous probability distribution function where random variable lies symmetrically around a mean (μ) and Variance (σ²).

高斯/正态分布是一个连续概率分布函数，其中随机变量对称地位于均值(μ)和方差(σ²)周围。

Image for post — general expression for Gaussian distribution curve

Mean (μ): It decides the position of the peak on X-axis. Also, all the data are symmetrically located on either side of the the line X = μ. As you can observe in the image shown, the Blue, Red and Yellow curves are spread either side of X=0 but Green curve is having its center at X= -2. So by looking these curves, we can easily say that mean of Blue, Red and Yellow is 0 whereas that of Green is -2.

均值(μ)：确定峰在X轴上的位置。而且，所有数据对称地位于线X =μ的两侧。正如您在所示图像中观察到的那样，蓝色，红色和黄色曲线分布在X = 0的两侧，而绿色曲线的中心位于X = -2。因此，通过查看这些曲线，我们可以轻松地说出蓝色，红色和黄色的均值为0，而绿色的均值为-2。

Variance (σ²): It decides the spread and height of the curve. Variance is nothing but the square of the standard deviation. Notice here in the image, σ² values for all the four curves are given. Now without looking at the values, we can easily say that the yellow curve has the lowest height and maximum spread and spread can be intuitively understood as standard deviation. So we can say that Yellow curve has maximum variance out of the four. Similarly Blue curve has minimum.

方差(σ²)：它决定曲线的展宽和高度。方差不过是标准偏差的平方。注意，在此图像中，给出了所有四个曲线的σ²值。现在，不用看这些值，我们可以轻松地说出黄色曲线的高度最低，并且最大展宽，并且可以直观地理解为标准差。因此，我们可以说，黄色曲线的四个方差最大。同样，蓝色曲线具有最小值。

If we put μ = 0 and σ = 1, the Normal distribution is then called Standard Normal Distribution or Standard Normal Variate and the general expression changes to:

如果我们将μ= 0且σ= 1，则正态分布称为标准正态分布或标准正态变量 ，并且一般表达式变为：

Now one can imagine, what does the denominator signify? Its’s there to ensure that the area under curve for Normal distribution is always equal to 1.

现在可以想象，分母表示什么？可以确保正态分布的曲线下面积始终等于1。

We get a lot of useful information about segmentation of data from Normal Distribution. Look at the image:

我们从正态分布中获得了很多有关数据分割的有用信息。看图片：

As you can see, this distribution stores 34.1% of total mass if we move one standard deviation right from mean, (34.1 + 13.6) = 47.7% of mass if we move 2 standard deviations right from mean and 49.8% when 3 standard deviation right. Since this curve is symmetrical, it holds for either sides.

如您所见，如果我们从均值向右移动一个标准偏差，则此分布将存储总质量的34.1％，如果我们从均值向右移动2个标准偏差，则该分布将存储质量的(34.1 + 13.6)=质量的47.7％，而当我们向右移动3个标准偏差时，则存储49.8％。由于该曲线是对称的，因此对于任一侧都适用。

So, now we know if any property follows a Normal distribution, e.g. weights of population in a town, we can easily estimate a lot of values without actually performing extensive analysis. This is the power of Normal Distribution.

因此，现在我们知道是否有任何属性遵循正态分布，例如，城镇人口的权重，我们可以轻松估算很多值，而无需实际进行大量分析。这就是正态分布的力量。

二项分布 (Binomial Distribution)

As we can see in the name, there is a “Bi”. So, this ‘Bi’ stands for 2 outcomes of an experiment, either Yes or No, either Pass or Fail, either 1 or 0 etc. In most simple terms this distribution is the distribution of multiple repeated experiments and their probabilities where the expected outcome is either “Success” or “Failure”.

正如我们在名称中看到的那样，有一个“ Bi”。因此，“ Bi”代表实验的2个结果，是或否，通过或失败，1或0等。最简单的说，此分布是多个重复实验及其概率的分布，其中预期结果是“成功”或“失败”。

As you can observe from image, it is a discrete probability distribution function. Main parameters are n (number of trials) and p (probability of success).

从图像中可以看到，它是离散的概率分布函数。主要参数是n(试验次数)和p(成功概率)。

Now suppose we have a probability p of SUCCESS of an event, then the probability of FAILURE is (1-p) and let us say you repeat the experiment n times (number of trials = n). Then probability of getting k successes in n independent Bernoulli trials is:

现在假设我们有一个事件成功的概率为p，那么失败的概率为(1-p)，可以说您重复了该实验n次(试验次数= n)。那么，在n个独立的伯努利试验中获得k次成功的概率为：

where k belongs in range [0,n] and:

其中k属于[0，n]范围，并且：

Note: We will see what is Bernoulli trial in next section.

注意：我们将在下一部分中看到什么是伯努利试验。

Let me ask a simple question. Suppose there is cricket match going on between India and Australia. Rohit Sharma has already scored 151* and by your experience you know that after 150 Rohit has a probability 0.3 of hitting a six. It’s the last over and your father asks you what are the chances that Rohit will hit 4 sixes. Then how would you find out?

让我问一个简单的问题。假设印度和澳大利亚之间正在进行板球比赛。罗希特·夏尔马(Rohit Sharma)已经获得151 *的分数，根据您的经验，您知道罗希特(Rohit)在150以后有0.3的概率达到6。这是最后一次结束，您的父亲问您，罗希特(Rohit)击中4个6的机会是多少？那你怎么知道

This is a typical example of Binomial trials. So, the solution is:

这是二项式试验的典型示例。因此，解决方案是：

Note: The 6 and 4 in big bracket is nothing but 6C4 which is combinations of 4 sixes in 6 balls.

注意：大括号中的6和4只是6C4，它是6个球中4个6的组合。

伯努利分布： (Bernoulli Distribution:)

In Binomial Distribution, we have a special case knows as Bernoulli Distribution where n=1 which means just a single trial is conducted in that binomial experiment. When we put n=1 in PMF (Probability Mass Function) of Binomial, the nCk will be equal to 1 and function becomes:

在二项分布中，我们有一个特殊情况称为伯努利分布 ，其中n = 1 ，这意味着在该二项式实验中仅进行了一次试验。当我们在二项式的PMF(概率质量函数)中放入n = 1时，nCk等于1，函数变为：

where k = {0,1}.

其中k = {0,1}。

Now let’s take the India vs Australia match. Let’s say when Rohit hits a ton then chances of India winning is 0.7. So you can simply tell your father that there is a 70% chance that India will win.It was nothing but a very basic Bernoulli trial.

现在让我们来看看印度对澳大利亚的比赛。假设Rohit达到1吨，那么印度获胜的机率是0.7。因此，您可以简单地告诉您的父亲，印度获胜的机率有70％，这只是一个非常基本的伯努利审判。

对数正态分布 (Log Normal Distribution)

We have seen the nature of Normal distribution and in first glance many would say that Log normal curve also somewhat gives a glimpse of Normal distribution which is right skewed.

我们已经看到了正态分布的性质，乍一看，很多人会说对数正态曲线在某种程度上也使人对正态分布有所偏斜。

Suppose there is a random variable X which follows Log Normal distribution with mean = μ and Variance = σ². X has a total n possible values (x1,x2,x3…..xn). Now take natural Log over all X values and create a new random variable Y = [log(x1),log(x2),log(x3)……log(xn)]. This random variable Y will be Normally distributed.

假设存在一个随机变量X，它遵循对数正态分布，均值=μ，方差=σ²。 X总共有n个可能的值(x1，x2，x3 ..... xn)。现在对所有X值取自然对数并创建一个新的随机变量Y = [log(x1)，log(x2)，log(x3)……log(xn)] 。该随机变量Y将呈正态分布。

In other words if there is a Normal Distribution Y, and we take it’s exponential function X = exp(Y) then X will follow Log Normal distribution. In simple language as name suggests Log Normal distribution is the distribution of a random variable whose natural log is Normally distributed.

换句话说，如果存在正态分布Y，并且我们采用它的指数函数X = exp(Y)，则X将遵循对数正态分布。用简单的语言顾名思义，对数正态分布是自然变量为正态分布的随机变量的分布。

It has also the same parameters as Gaussian: mean (μ) and Variance (σ²).

它还具有与高斯相同的参数： 均值(μ)和方差(σ²) 。

幂律/帕累托分布 (Power Law/Pareto Distribution)

Power Law is a relationship between two quantities in which changes in one quantity will proportionally change the other quantity. It follows a 80–20 rule which says: in top 20% of values, we will find roughly 80% of mass density. As you can see in the image, the slightly darker left portion is 80% of mass and the right bright yellow is 20%.

幂律是两个量之间的关系，其中一个量的变化将成比例地改变另一个量。它遵循80–20规则，即：在值的前20％中，我们将发现质量密度大约为80％。如您在图像中看到的，左侧稍暗的部分占质量的80％，右侧亮黄色的部分占20％。

When a probability distribution follows a power law we say it is a Pareto Distribution.

当概率分布遵循幂定律时，我们说它是帕累托分布。

Pareto distribution is controlled by two parameters: x_m and α.

帕累托分布受两个参数控制： x_m和α。

x_m can be thought of as mean which controls scale of curve and α can be thought of as σ which controls the shape of curve. (Note: x_m is not mean and α is not σ. I am speaking intuitively for understanding.)

可以将x_m视为控制曲线比例的均值，将α_视为控制曲线形状的σ。 (注意：x_m不是均值，α不是σ。我直觉地说是为了理解。)

Now as we can see in the image, all four curves have their peak located at x=1. So, we can say that x_m = 1 for all the curves.

现在，如我们在图像中看到的，所有四个曲线的峰值都位于x = 1。因此，我们可以说所有曲线的x_m = 1。

As we can observe from the image, as α increases the peak also goes up and and in extreme case of α tending to infinity, the curve transforms into merely a vertical line. This is called a Dirac Delta Function.

从图像中我们可以看到，随着α的增加，峰值也会上升，在极端情况下，α趋于无穷大，曲线仅变成一条垂直线。这称为Dirac Delta函数 。

As α reduces, the flatness of curve increases.

随着α的减小，曲线的平坦度增加。

发行用途 (Uses of Distributions)

If we know a particular property follows a certain dist then we can take a sample and find the parameters involved and then can plot the Probability Distribution function to answer lot of question.

如果我们知道某个特定属性遵循一定距离，那么我们可以取样并找到涉及的参数，然后可以绘制概率分布函数来回答很多问题。

For ex: In a town of 100,000 people, we have to do height analysis, but we cannot do a survey for such a large population. So, we select a random sample and find it sample mean and sample standard deviation.

例如：在一个有10万人的小镇上，我们必须进行身高分析，但是我们无法对如此庞大的人口进行调查。因此，我们选择一个随机样本，并找到样本均值和样本标准差。

Now suppose a doctor or expert tells us height follows a Normal distribution. Then we can easily answer many questions.

现在，假设医生或专家告诉我们身高遵循正态分布。然后，我们可以轻松回答许多问题。

翻译自: https://medium.com/analytics-vidhya/important-distributions-in-probability-statistics-a868283fa127

统计概率分布

weixin_26752765

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
统计概率分布_概率统计中的重要分布

统计概率分布Random Variables follow different types of distribution in probability space which decides their behaviour and helps in predictions. 随机变量在概率空间中遵循不同的分布类型，这决定了它们的行为并有助于预测。 Table of contents: 目录: ...
复制链接

扫一扫