本福德法则 2位数_Python中的本福德定律

最新推荐文章于 2024-10-14 23:40:45 发布

weixin_26746861

最新推荐文章于 2024-10-14 23:40:45 发布

阅读量1.1k

点赞数

文章标签： python

原文链接：https://medium.com/explorations-in-python/benfords-law-in-python-64c28a560c98

版权

本文介绍了如何在Python中应用本福德法则，特别是针对2位数的情况。通过翻译自https://medium.com/explorations-in-python/benfords-law-in-python-64c28a560c98的文章，详细探讨了该统计规律。

摘要由CSDN通过智能技术生成

本福德法则 2位数

Python implementation of Benford’s Law which describes the distribution of the first digits of most sets of numeric data.

Benford定律的Python实现，它描述了大多数数字数据集的第一位数字的分布。

I recently posted an article on Zipf’s Law and the application of the Zipfian Distribution to word frequencies in a piece of text. Benford’s Law can be considered a special case of Zipfian Law.

我最近发表了一篇有关齐普夫定律以及齐普夫分布在一段文本中的词频应用的文章。本福德定律可以认为是齐普夫定律的特例。

Applying the Zipfian Distribution (I love the word Zipfian!) to bodies of text is fraught with difficulty due to the vagaries of natural language and I am sceptical of the practicalities and usefulness of doing so beyond demoing the Zipfian Distribution and its implementation.

由于自然语言的多变，将Zipfian分布(我喜欢Zipfian！这个词)应用于文本主体非常困难，并且我对这样做的实用性和实用性表示怀疑，而不是演示Zipfian分布及其实现。

However, the Benford Distribution is very different. If you have a set of numeric data you can try to fit it to the Benford Distribution without worrying about all the quirks and messiness of natural language. You don’t even have to worry about the total number of possible distinct values — every number starts with 1 to 9.

但是，本福德分布非常不同。如果您有一组数字数据，则可以尝试将其适合Benford分布，而不必担心自然语言的所有怪异和混乱。您甚至不必担心可能的不同值的总数-每个数字都以1到9开头。

Benford’s Law centres on the perhaps surprising fact that in numeric data such as financial transaction, populations, sizes of geographical features etc. the frequencies of first digits follow a roughly reciprocal pattern. This is shown in the following table, the relative frequencies being calculated using the formula in the heading of the right hand column.

本福德定律集中在一个令人惊讶的事实上，即在诸如金融交易，人口，地理特征大小等数字数据中，第一位数字的频率大致呈倒数模式。如下表所示，相对频率是使用右列标题中的公式计算的。

If the first digits followed a uniform distribution which you might intuitively expect each digit would appear about 11% of the time. However, in the Benford Distribution the number 1 occurs about 30% of the time, 2 about 18% of the time etc.. This is clearer when shown in a graph.

如果第一位数字遵循均匀分布，您可以直观地预期每个数字将出现大约11％的时间。但是，在Benford分布中，数字1出现的时间约为30％，数字2出现的时间约为18％，依此类推。在图中显示时更清楚。

The best known use of Benford’s Law is in fraud detection. If someone makes up false data it is unlikely to follow the Benford Distribution you would expect from genuine data, and if the numbers are purely random the first digits would probably fit a uniform distribution, ie. 11% each as described above.

本福德定律最著名的用途是欺诈检测。如果有人伪造数据，则不太可能遵循真实数据所期望的Benford分布；如果数字是完全随机的，则第一位数可能适合统一分布，即。如上所述各占11％。

* I have mentioned first digits several times and you may be wondering about subsequent digits. Do they also fit the Benford Distribution? The answer is yes but to a decreasing amount. As you move along the digits the distributions become less Benfordian and more uniform.

*我已经多次提到第一位数字，您可能想知道后面的数字。 它们也适合本福德分布吗？ 答案是肯定的，但数量正在减少。 当您沿数字移动时，分布将变得更少本福德式，而变得更加均匀。

For this post I’ll write a function to take a list of numbers and create a table of values showing how closely it fits the Benford Distribution.

在这篇文章中，我将编写一个函数以获取数字列表并创建一个值表，以显示其与本福德分布的紧密程度。

I’ll then try it out with two sets of data, a fraudulent-looking one which roughly fits the uniform distribution and a genuine-looking one which roughly fits the Benford Distribution. I will also create a simple console graph to show the results.

然后，我将使用两套数据进行尝试，一个看起来很伪造的数据大致适合于统一分布，另一个看起来是真伪的数据大致适合于本福德分布。我还将创建一个简单的控制台图以显示结果。

The project consists of the following two files which you can clone or download the Github repository.

该项目由以下两个文件组成，您可以克隆或下载Github存储库。

benfordslaw.py
本福德斯劳
benfordslaw_demo.py
benfordslaw_demo.py

Let’s look at benfordslaw.py.

让我们看一下benfordslaw.py 。

Firstly we import collections to use collections.Counter. Then comes a constant list of the relative frequencies of the digits 1 to 9 as decimal fractions. I have padded it out with a superfluous 0 so we can index the values using the digits they represent. It is up the top outside the function so it can be used by external code.

首先，我们导入collections以使用collections.Counter 。然后是常数1到9的相对频率的常数列表(以小数部分表示)。我用多余的0填充了它，因此我们可以使用它们所代表的数字来索引值。它位于函数外部的最上方，因此可以由外部代码使用。

The calculate function takes a list and creates from it a list of the first digits (as strings) which is then used to construct a collections.Counter. This object now contains a set of key/value pairs for each distinct first digit. A useful feature of Counter is that if we try to access a key which does not exist it will return 0.

calculate函数获取一个列表，并从中创建一个第一位数字列表(作为字符串)，然后将其用于构建collections.Counter 。现在，该对象为每个不同的第一位包含一组键/值对。 Counter一个有用功能是，如果我们尝试访问一个不存在的密钥，它将返回0。

Next we need to iterate from 1 to 9, calculating a few values for each digit showing how it compares to the Benford Distribution. These are then used to create dictionaries which are added to a list which is then returned.

接下来，我们需要从1到9进行迭代，为每个数字计算一些值，以显示其与Benford分布的比较。然后使用它们创建字典，将其添加到列表中，然后返回列表。

Now we can try the module out in benfordslaw_demo.py.

现在，我们可以在benfordslaw_demo.py中试用该模块。

主要 (main)

After importing random and benfordslaw we enter the main function. This calls one of two functions to get some test data, and then throws it at benfordslaw.calculate. The results are then passed to print_as_table and print_as_graph.

导入random和benfordslaw之后，我们进入main功能。这将调用两个函数之一来获取一些测试数据，然后将其benfordslaw.calculate 。然后将结果传递到print_as_table和print_as_graph 。

get_random_data (get_random_data)

This is a simple function which creates 1000 random values between 1 and 1000.

这是一个简单的函数，可在1到1000之间创建1000个随机值。

get_benford_data (get_benford_data)

This is more complex because it creates about 1000 values which roughly fit the Benford Distribution. For each digit 1–9 it iterates from 1 to the proportion of each digit in the Benford Distribution, retrieved from the benfordslaw module. So we don’t get an exact fit each time this is adjusted by a random amount between 0.8 and 1.2. Within this loop we create a set of values with the relevant first digit.

这更加复杂，因为它会创建大约1000个大致符合Benford分布的值。对于从1到9的每个数字，它会从1迭代到从benfordslaw模块中获取的Benford分布中每个数字的比例。因此，每次将其调整为0.8到1.2之间的随机数时，我们都无法获得精确拟合。在此循环中，我们使用相关的第一位数字创建一组值。

print_as_table (print_as_table)

A fiddly but straightforward function to print out the data structure returned by benfordslaw.calculate in a table.

一个简单但简单的函数， benfordslaw.calculate在表中打印出benfordslaw.calculate返回的数据结构。

print_as_graph (print_as_graph)

This also takes the data structure returned by benfordslaw.calculate and prints it as a graph in the console. Firstly we define some ANSI terminal colour constants and after some hard-coded headings we iterate the data, printing out the Benford and actual frequencies as bars in green and red respectively.

这还将采用benfordslaw.calculate返回的数据结构，并将其作为图形打印在控制台中。首先，我们定义一些ANSI终端颜色常数，并在经过一些硬编码的标题之后，对数据进行迭代，分别以绿色和红色的条形输出Benford和实际频率。

That’s the coding finished so let’s run the program.

编码完成了，所以让我们运行程序。

python3.8 benfordslaw_demo.py

This is the output with random data.

这是带有随机数据的输出。

The green Benford Distribution bars are always the same but the red bars show the actual distribution. In this case we can easily see that the actual distribution is roughly uniform and doesn’t fit the Benford Distribution at all. Very suspicious!

绿色的Benford分布条始终相同，而红色条显示实际分布。在这种情况下，我们可以轻松地看到实际分布是大致均匀的，根本不适合Benford分布。非常可疑！

In main comment out data = get_random_data(), uncomment data = get_benford_data() and run the program again.

在main注释掉data = get_random_data() ，取消注释data = get_benford_data()并再次运行程序。