贝叶斯统计传统统计_统计贝叶斯如何补充常客

最新推荐文章于 2021-01-04 08:56:57 发布

weixin_26730921

最新推荐文章于 2021-01-04 08:56:57 发布

阅读量451

点赞数

文章标签： python java 人工智能大数据数据库

原文链接：https://towardsdatascience.com/statistics-how-bayesian-can-complement-frequentist-9ff171bb6396

版权

本文探讨了贝叶斯统计和传统统计之间的差异，并解释了如何在数据分析中，贝叶斯方法可以作为传统统计方法的有效补充。通过案例分析，展示了贝叶斯统计在处理不确定性、先验知识和实时更新方面的能力。

摘要由CSDN通过智能技术生成

贝叶斯统计传统统计

For many years, academics have been using so-called frequentist statistics to evaluate whether experimental manipulations have significant effects.

多年以来，学者们一直在使用所谓的常客统计学来评估实验操作是否具有significant效果。

Frequentist statistic is based on the concept of hypothesis testing, which is a mathematical based estimation of whether your results can be obtained by chance. The lower the value, the more significant it would be (in frequentist terms). By the same token, you can obtain non-significant results using the same approach. Most of these "negative" results are disregarded in research, although there is tremendous added value in also knowing what manipulations do not have an effect. But that’s for another post ;)

频率统计基于假设检验的概念，假设检验是基于数学的估计，您是否可以偶然获得结果。值越低，它的意义就越大(以常用术语而言)。同样，您可以使用相同的方法获得不重要的结果。尽管大多数“负面”结果在了解什么操作没有效果的过程中具有巨大的附加价值，但它们在研究中被忽略。但这是另一篇文章;)

Thing is, in such cases where no effect can be found, frequentist statistics are limited in their explanatory power, as I will argue in this post.

事实是，在找不到效果的情况下，常客统计资料的解释力受到限制，正如我将在本文中指出的那样。

Below, I will be exploring one limitation of frequentist statistics, and proposing an alternative method to frequentist hypothesis testing: Bayesian statistics. I will not go into a direct comparison between the two approaches. There is quite some reading out there, if you are interested. I will rather explore how why the frequentist approach presents some shortcomings, and how the two approaches can be complementary in some situations (rather than seeing them as mutually exclusive, as sometimes argued).

下面，我将探讨频率论者统计的局限性，并提出一种用于频率论者假设检验的替代方法： Bayesian统计。我不会直接比较这两种方法。如果您有兴趣的话，可以在这里读很多书。我宁愿探索为什么频频主义者的方法会带来一些缺点，以及两种方法在某些情况下如何互补(而不是像有时所说的那样将它们视为互斥的)。

This is the first of two posts, where I will be focusing on the inability of frequentist statistics to disentangle between the absence of evidence and the evidence of absence.

这是两篇文章中的第一篇，我将重点关注常客统计数据无法区分缺乏证据和缺乏证据之间的情况。

缺乏证据与缺乏证据 (Absence of evidence vs evidence of absence)

背景 (Background)

In the frequentist world, statistics typically output some statistical measures (t, F, Z values… depending on your test), and the almighty p-value. I discuss the limitations of only using p-values in another post, which you can read to get familiar with some concepts behind its computation. Briefly, the p-value, if significant (i.e., below an arbitrarily decided threshold, called alpha level, typically set at 0.05), determines that your manipulation most likely has an effect.

在常人世界中，统计数据通常会输出一些统计量度(t，F，Z值……取决于您的测试)以及全能的p值。我将在另一篇文章中讨论仅使用p值的局限性，您可以阅读以熟悉其计算背后的一些概念。简而言之，如果p值显着(即低于任意确定的阈值，称为alpha水平，通常设置为0.05)，则表明您的操作最有可能产生效果。

However, what if (and that happens a lot), your p-value is > 0.05? In the frequentist world, such p-values do not allow you to disentangle between an absence of evidence and an evidence of absence of effect.

但是，如果(而且经常发生)您的p值> 0.05怎么办？在常识世界中，此类p值不允许您在缺乏证据和缺乏效果的证据之间做出区分。

Let that sink in for a little bit, because it is the crucial point here. In other words, frequentist statistics are pretty effective at quantifying the presence of an effect, but are quite poor at quantifying evidence for the absence of an effect. See here for literature.

让它陷入一点，因为这是关键。换句话说，频繁出现的统计数据在量化效果存在方面非常有效，但在量化效果不存在的证据方面却很差。有关文学，请参见此处。

The demonstration below is taken from some work that was performed at the Netherlands Institute for Neuroscience, back when I was working in neuroscience research. A very nice paper was recently published on this topic, that I encourage you to read. The code below is inspired by the paper repository, written in R.

下面的演示摘自我在神经科学研究领域工作时在荷兰神经科学研究所所做的一些工作。最近发表了一篇关于该主题的非常好的论文，我鼓励您阅读。以下代码受R编写的纸质存储库的启发。

模拟数据 (Simulated Data)

Say we generate a random distribution with mean=0.5 and standard deviation=1.

假设我们生成一个均值= 0.5和标准差= 1的随机分布。

np.random.seed(42)
mean = 0.5; sd=1; sample_size=1000
exp_distibution = np.random.normal(loc=mean, scale=sd, size=sample_size)
plt.hist(exp_distibution)

Image for post — Figure 1 | Histogram depicting random draw from a normal distribution centered at 0.5

That would be our experimental distribution, and we want to know whether that distribution is significantly different from 0. We could run a one sample t-test (which would be okay since the distribution seems very Gaussian, but you should theoretically prove that parametric testing assumptions are fulfilled; let’s assume they are)

那将是我们的实验分布，我们想知道该分布是否与0显着不同。我们可以运行一个样本t检验(因为分布看起来非常高斯，所以可以，但是理论上您应该证明参数测试满足假设；让我们假设它们是)

t, p = stats.ttest_1samp(a=exp_distibution, popmean=0)
print(‘t-value = ‘ + str(t))
print(‘p-value = ‘ + str(p))

Quite a nice p-value that would make every PhD student’s spine shiver with happiness ;) Note that with that kind of sample size, almost anything gets significant, but let’s move on with the demonstration.

相当不错的p值会使每个博士生都对幸福感颤抖；)请注意，使用这种样本量，几乎所有东西都变得很重要，但让我们继续进行演示。

Now let’s try a distribution centered at 0, which should not be significantly different from 0

现在，让我们尝试以0为中心的分布，该分布与0的差别应该不大

mean = 0; sd=1; sample_size=1000
exp_distibution = np.random.normal(loc=mean, scale=sd, size=sample_size)
plt.hist(exp_distibutiont, p = stats.ttest_1samp(a=exp_distibution, popmean=0)
print(‘t-value = ‘ + str(t))
print(‘p-value = ‘ + str(p))

Here, we have as expected a distribution that does not significantly differ from 0. And here is where things get a bit tricky: in some situations, frequentist statistics cannot really tell whether a p-value > 0.05 is an absence of evidence, and an evidence for absence, although that is a crucial point that would allow you to completely rule out an experimental manipulation from having an effect.

在这里，我们期望的分布与0的差异不大。在这里，情况变得有些棘手：在某些情况下，常客统计学不能真正判断p值> 0.05是否缺少证据，而缺席的证据，尽管这是至关重要的一点，可以让您完全排除实验性操作的影响。

Let’s take an hypothetical situation:

让我们假设一个情况：

You want to know whether a manipulation has an effect. It might be a novel marketing approach in your communication, a interference with biological activity or a “picture vs no picture” test in a mail you are sending. You of course have a control group to compare your experimental group to.

您想知道操作是否有效。这可能是您交流中的一种新颖的营销方式，是对生物活动的干扰，也可能是您发送的邮件中的“图片与无图片”测试。您当然有一个对照组来比较您的实验组。

When collecting your data, you could see different patterns:

收集数据时，您会看到不同的模式：

(i) the two groups differ.
(i)两组不同。
(ii) the two groups behave similarly.
(ii)两组的行为相似。
(iii) you do not have enough observations to conclude (sample size too small)
(iii)您没有足够的观察结论(样本量太小)

While option (i) is an evidence against the null hypothesis H0 (i.e., you have evidence that your manipulation had an effect), situations (ii) (=evidence for H0, i.e, evidence of absence) and (iii) (=no evidence, i.e, absence of evidence) cannot be disentangled using frequentist statistics. But maybe the bayesian approach can add something to this story...

尽管选项(i)是针对null hypothesis H0的证据(即，您有证据证明您的操纵有效果)，但情况(ii)(= H0的证据，即不存在的证据)和(iii)(=否)证据，即没有证据)不能使用常客统计来弄清。但是也许贝叶斯方法可以为这个故事增添些...

p值如何受效应和样本量影响 (How p-values are affected by effect and sample sizes)

The first thing is to illustrate the situations where frequentist statistics have shortcomings.

首先是要说明常客统计数据存在缺陷的情况。

方法背景 (Approach background)

What I will be doing is plotting how frequentist p-values behave when changing both effect size (i.e., the difference between your control, here with a mean=0, and your experimental distributions) and sample size (number of observations or data points).

我要做的是绘制同时更改效果大小 (即，控件的均值= 0和实验分布之间的差异)和样本大小 (观察值或数据点的数量)时，频繁P值的行为。

Let’s first write a function that would compute these p-values:

让我们首先编写一个可以计算这些p值的函数：

def run_t_test(m,n,iterations):
    """
    Runs a t-test for different effect and sample sizes and stores the p-value
    """
    my_p = np.zeros(shape=[1,iterations])
    for i in range(0,iterations):
        x = np.random.normal(loc=m, scale=1, size=n)
        # Traditional one tailed t test
        t, p = stats.ttest_1samp(a=x, popmean=0)
        my_p[0,i] = p
    return my_p

We can then define the parameters of the space we want to test, with different sample and effect sizes:

然后，我们可以使用不同的样本和效果大小来定义要测试的空间的参数：

# Defines parameters to be tested
sample_sizes = [5,8,10,15,20,40,80,100,200]
effect_sizes = [0, 0.5, 1, 2]
nSimulations = 1000

We can finally run the function and visualize:

我们最终可以运行该函数并进行可视化：

# Run the function to store all p-values in the array "my_pvalues"
my_pvalues = np.zeros((len(effect_sizes), len(sample_sizes),nSimulations))for mi in range(0,len(effect_sizes)):
    for i in range(0, len(sample_sizes)):
        my_pvalues[mi,i,] = run_t_test(m=effect_sizes[mi], 
                                n=sample_sizes[i], 
                                iterations=nSimulations
                               )

I will quickly visualize the data to make sure that the p-values seem correct. The output would be:

我将快速可视化数据以确保p值看起来正确。输出为：

p-values for sample size = 5
Effect sizes:
          0       0.5       1.0         2
0  0.243322  0.062245  0.343170  0.344045
1  0.155613  0.482785  0.875222  0.152519
p-values for sample size = 15
Effect sizes:
          0       0.5       1.0             2
0  0.004052  0.010241  0.000067  1.003960e-08
1  0.001690  0.000086  0.000064  2.712946e-07

I would make two main observations here:

我将在这里做两个主要观察：

When you have high enough sample size (lower section), the p-values behave as expected and decrease with increasing effect sizes (since you have more robust statistical power to detect the effect).
当样本量足够大时(下半部分)，p值将按预期表现，并随着效果大小的增加而减小(因为您有更强大的统计能力来检测效果)。
However, we also see that the p-values are not significant for a small sample sizes, even if the effect sizes are quite large (upper section). That is quite striking, since the effect sizes are the same, only the number of data points is different.
但是，我们也看到即使样本量很大(上半部分)，p值对于小样本量也并不重要。这是非常惊人的，因为效果大小相同，所以只有数据点的数量不同。

Let’s visualize that.

让我们想象一下。

可视化 (Visualization)

For each sample size (5, 8, 10, 15, 20, 40, 80, 100, 200), we will count the number of p-values falling in significance level bins.

对于每个样本大小(5、8、10、15、20、40、80、100、200)，我们将计算落入显着性等级箱中的p值的数量。

Let’s first compare two distributions of equal mean, that is, we have an effect size = 0.

让我们首先比较两个均值相等的分布，即我们的效果大小= 0。

As we can see from the plot above, most of the p-values computed by the t-test are not significant for an experimental distribution of mean 0. That makes sense, since these two distributions are not different in their means.

从上图可以看出，通过t检验计算出的大多数p值对于平均值为0的实验分布而言并不重要。这是有道理的，因为这两种分布的均值没有差异。

We can, however, see that in some cases, we do obtain significant p values, which can happen when using very particular data points drawn from the overall population. These are typically false positive, and the reason why it is important to repeat experiments and replicate results ;)

但是，我们可以看到，在某些情况下，我们确实获得了显着的p值，当使用从总体总体中得出的非常特殊的数据点时，可能会发生这种情况。这些通常都是假阳性，是重复实验和复制结果很重要的原因；)

Let’s see what happens if we use a distribution whose mean differs by 0.5 compared to the control:

让我们看看如果我们使用与控件相比均值相差0.5的分布会发生什么：

Now, we clearly see that increasing sample size dramatically increases the ability to detect the effect, with still many non significant p-values for low sample sizes.

现在，我们清楚地看到，增加样本量会极大地提高检测效果的能力，但对于低样本量，仍有许多不重要的p值。

Below, as expected, you see that for highly different distributions (effect size = 2), the number of significant p-values increase:

如下所示，可以看到，对于高度不同的分布(效果大小= 2)，有效p值的数量增加：

OK, so that was it for an illustrative example of how p-values are affected by sample and effect sizes.

好的，那是一个示例性示例，说明p值如何受样本和效果大小影响。

Now, the problem is that when you have a non significant p value, you are not always sure whether you might have missed the effect (say because you had a low sample size, due to limited observations or budget) or whether your data really suggest the absence of an effect. As matter of fact, most scientific research have a problem of statistical power, because they have limited observations (due to experimental constraints, budget, time, publishing time pressure, etc…).

现在的问题是，当您的p值不显着时，您将无法始终确定是否可能错过了效果(例如，由于观察或预算有限，样本量较小)还是您的数据确实暗示了没有效果。实际上，大多数科学研究都有统计能力的问题，因为它们的观察力有限(由于实验限制，预算，时间，出版时间压力等)。

Since the reality of data in research is a rather low sample size, you still might want to draw meaningful conclusions from non significant results based on low sample sizes.

由于研究中数据的真实性相当低，因此您可能仍想根据低样本量从不重要的结果中得出有意义的结论。

Here, Bayesian statistics could help you make one more step with your data ;)

在这里，贝叶斯统计信息可以帮助您在数据处理方面迈出新一步；)

Stay tuned for the following post where I explore the Titanic and Boston data sets to demonstrate how Bayesian statistics can be useful in such cases!

请继续关注以下文章，在该文章中我将探索泰坦尼克号和波士顿的数据集，以证明贝叶斯统计量在这种情况下如何有用！

You can find this notebook in the following repo: https://github.com/juls-dotcom/bayes

您可以在以下回购中找到此笔记本： https : //github.com/juls-dotcom/bayes