本福德法则 2位数_什么不遵循本福德定律

本福德法则 2位数

Like many others, the first time I heard about Benford’s law, I thought: “What? That’s weird! What’s the trick?” And then, there is no trick. It is just there. It is a law that applies for no apparent good reason.

像许多其他人一样,我第一次听说本福德定律时,心想:“什么? 那真是怪了! 诀窍是什么?” 然后,没有把戏。 就在那里。 这是一条没有明显理由的法律。

If you have never heard of it, let’s look at what it is: Imagine a set of numbers from some real-life phenomenon. Say, for example, the populations of all inhabited places on Earth. There would be thousands of numbers, some very big, some very small. Those numbers exist not because of some systematic process but somehow emerged out of the thousands of years of the lives of billions of people. You would therefore expect them to be almost completely random. Thinking about those numbers, if I asked you: “How many of them start with the digit 1, compared to 2, 3, 4, etc.”, intuitively you would probably say: “more or less just as many”. You would be wrong. The answer is: significantly more.

如果您从未听说过它,那就看看它是什么:想象一下一些现实生活中的数字。 举例来说,地球上所有居住地区的人口。 会有成千上万个数字,有些很大,有些很小。 这些数字之所以存在,并不是因为有些系统的过程,而是因为它是从数十亿人的数千年生活中诞生的。 因此,您希望它们几乎完全是随机的。 考虑一下这些数字,如果我问你:“其中有多少以数字1开头,而数字2、3、4等。”直觉上你可能会说:“或多或少一样多”。 你会错的。 答案是:明显更多。

According to Benford’s law (which should really be called Newcomb–Benford law, see below), in large sets of naturally-occurring numbers spanning across multiple orders of magnitude, the leading digit of any number is much more likely to be small than big. How more likely? Formally, the probability P(d) that a number starts with the digit d is given by:

根据本福德定律(应真正称为纽康-本福德定律,见下文),在跨越多个数量级的大量自然发生数中,任何数字的前导数字很有可能比大数字小。 更有可能吗? 形式上,数字以数字d开头的概率P(d)由下式给出:

P(d) = log10(1+1/d)

P(d)= log10(1 + 1 / d)

That means that in those sets of naturally-occurring numbers, the probability that a number will start with 1 is just over 30%, while the probability that it will start with 9 is just under 5%. Weird right?

这意味着在那些自然出现的数字集中,数字以1开头的概率略高于30%,而数字以9开头的概率略低于5%。 对吗?

Image for post
Probability of a digit starting a number according to Benford’s law.
根据本福德定律,数字以数字开头的概率。

This strange natural/mathematical phenomenon was first discovered by Simon Newcomb (hence the complete name) who noticed that the pages at the beginning of books containing logarithmic tables, which start with 1, were much more worn out than the pages at the end (starting with 9). Based on this observation, showing that people tended to need logarithmic tables more often for numbers starting with 1, he first formulated what is now known as the Newcomb-Benford law, although with a slightly different formula for the probability of the first digit. It was re-discovered by Frank Benford more than 65 years later, and tested on several different things, including populations in the US. Since then, it has been tested on and applied to many things, from financial fraud detection to code, parking spaces or even COVID. The idea is that if a set of numbers occur or emerge naturally, without being doctored or somehow artificially constrained, there is a good chance they will follow Benford’s law. If they don’t follow the law, there is something fishy.

这个奇怪的自然/数学现象最初是由西蒙·纽康( Simon Newcomb)发现的(因此全称),他注意到包含对数表(以1开头)的书开头的页面比末尾的页面(开头的页面)磨损得多。与9)。 基于此观察结果,他表明人们倾向于对以1开头的数字更需要对数表,他首先制定了现在称为纽康-本福德定律的公式,尽管对于第一位数的概率公式略有不同。 65多年后,弗兰克·本福德( Frank Benford )重新发现了它,并在包括美国人口在内的多种不同方面进行了测试。 从那时起,它就已经进行了测试并应用于许多方面,从财务欺诈检测到代码停车位甚至COVID 。 这个想法是,如果一组数字自然出现或出现,而没有被篡改或以人为方式加以约束,则它们很有可能会遵循本福德定律。 如果他们不遵守法律,那就有些可疑了。

But is this really universally true, and to what extent? We can assume that numbers representing different phenomena follow Benford’s law to a different extent, sometimes more, sometimes less, and sometimes not. So the question is:

但这在世界范围内确实是真的吗? 我们可以假设代表不同现象的数字在不同程度上遵循本福德定律,有时更多,有时更少,有时则不同。 所以问题是:

What follows Benford’s law, and perhaps more importantly, what does not?

什么遵循本福德定律,也许更重要的是,什么不遵循?

To answer that, we need a lot of those sets of naturally-occurring numbers.

要回答这个问题,我们需要大量的自然数。

维基解密 (Wikidata to the rescue)

Newcomb and Benford were not quite as lucky as we are. To find sets of numbers on which to test their law, they had to manually collect them from whatever source was available. Nowadays, not only do we have a universally accessible encyclopedia of everything, we have a data version of it: Wikidata.

纽科姆和本福德并不像我们那样幸运。 为了找到检验其法律的数字集,他们必须从任何可用来源中手动收集它们。 如今,我们不仅拥有所有事物的通用百科全书,而且还具有其数据版本: Wikidata

Wikidata is the Wikipedia of data. It is a crowdsourced database of, if not everything, quite a big part of it. It is possible for example using Wikidata to quickly obtain, with a relatively simple query, the size of the populations of every US cities and many, many more things. It should therefore also be possible to obtain all the sets of numbers it contains.

Wikidata是数据的维基百科。 它是一个众包数据库,即使不是全部,它也占很大一部分。 例如,可以使用Wikidata通过一个相对简单的查询来快速获取每个美国城市的人口规模以及许多其他事物。 因此,也应该可以获得它包含的所有数字集。

To do that, we use the RDF-based representation of Wikidata. RDF (Resource Description Framework) is a graph-based data representation for the web. Basically, things in RDF are represented by URIs, and connected by labelled edges to other things or values. For example, the figure below shows a simplified extract of what the representation of the city of Galway, located in Ireland and with a population of 79,504 people looks like in Wikidata’s RDF.

为此,我们使用基于RDF的Wikidata表示形式。 RDF (资源描述框架)是Web的基于图的数据表示形式。 基本上,RDF中的事物由URI表示,并通过标记的边缘连接到其他事物或值。 例如,下图显示了Wikidata的RDF中位于爱尔兰,人口为79,504人的高威市的简化表示。

Image for post
Graph representation of some information about Galway from Wikidata.
Wikidata中有关Galway的一些信息的图形表示。

The nice thing about RDF is that a very, very large graph, can be represented by a set of triples of the form <subject,predicate,object>. Each of those triples corresponds to an edge in the graph, and represents one atomic piece of information.

关于RDF的好处是,非常大的图形可以用一组<subject,predicate,object>的三元组表示。 这些三元组中的每一个对应于图中的一条边,并代表一条原子信息。

<http://www.wikidata.org/entity/Q129610> <http://www.w3.org/2000/01/rdf-schema#label> “Galway” .<http://www.wikidata.org/entity/Q129610> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.wikidata.org/entity/Q515> .<http://www.wikidata.org/entity/Q515> <http://www.w3.org/2000/01/rdf-schema#label> “City” .<http://www.wikidata.org/entity/Q129610> <http://www.wikidata.org/entity/P17> <http://www.wikidata.org/entity/Q27> .<http://www.wikidata.org/entity/Q27> <http://www.w3.org/2000/01/rdf-schema#label> “Ireland” .<http://www.wikidata.org/entity/Q129610> <http://www.wikidata.org/entity/P1082> “75,504”^^<http://www.w3.org/2001/XMLSchema#Integer> .

So the first step in collecting sets of numbers from Wikidata is to extract all the triples for which the object part is a number.

因此,从Wikidata收集数字集的第一步是提取对象部分为数字的所有三元组。

We start by downloading the full dump of the entire Wikidata database as a compressed NTriples (NT) file. NT is a terribly inefficient representation of RDF where each triple is represented in a line. The GZipped file to download (latest-all.nt.gz) is quite large (143GB) and I would not recommend trying to uncompress it. However, because each triple is represented completely independently from the rest on one line, this format makes it very easy to filter the data with basic linux command-line tools without having to load the whole thing in memory. So, to extract triples which objects are numbers, we use zgrep (grep that works on GZipped files) to find triples that have a reference to the decimal, integer or double types in the following way:

我们首先以压缩的NTriples(NT)文件下载整个Wikidata数据库的完整转储。 NT是RDF的非常低效的表示形式,其中每个三元组都以一行表示。 要下载的GZipped文件( latest-all.nt.gz )很大(143GB),我不建议尝试对其进行解压缩。 但是,由于每个三元组都完全独立于其余三行,所以这种格式使得使用基本Linux命令行工具非常容易地过滤数据,而不必将整个内容加载到内存中。 因此,要提取对象是数字的三元组,我们使用zgrep (适用于GZipped文件的grep )以以下方式查找引用了小数整数精度类型的三元组:

zgrep “XMLSchema#decimal” latest-all.nt.gz > numbers.decimal.nt
zgrep “XMLSchema#double” latest-all.nt.gz > numbers.double.nt
zgrep “XMLSchema#integer” latest-all.nt.gz > numbers.integer.nt

Then we can put all of those together using the cat command:

然后我们可以使用cat命令将所有这些放在一起:

cat numbers.decimal.nt numbers.double.nt numbers.integer.nt > numbers.nt

And check how many triples this 110GB file ends-up containing by counting the number of lines in it:

并通过计算其中的行数来检查此110GB文件最终包含的三倍:

wc -l numbers.nt 
730238932 numbers.nt

Each line is a triple. Each triple as a number as value (object). That’s a lot of numbers.

每行是三元组。 每个三元组作为一个数字作为值(对象)。 有很多数字。

The next step is to organise those numbers into meaningful sets and count how many in each set start with 1, with 2, with 3, etc. Here, we use the “predicate” part of the triple. There are, for example, 621,574 triples in this file that have <http://www.wikidata.org/entity/P1082> as predicate. http://www.wikidata.org/entity/P1082 is the property Wikidata uses to represent the population of inhabited places. So we can group all of those, and make it into the set of all populations known to Wikidata. That will be one of the naturally occurring sets of numbers that we will test.

下一步是将这些数字组织成有意义的集合,并计算每个集合中以1、2、3等开头的数字。这里,我们使用三元组的“谓词”部分。 例如,此文件中有621,574个三元组以< http://www.wikidata.org/entity/P1082 >作为谓词。 http://www.wikidata.org/entity/P1082是Wikidata用来表示居住人口的属性。 因此,我们可以将所有这些分组,并将其归入Wikidata已知的所有种群的集合。 那将是我们将测试的自然出现的一组数字之一。

The simple python script below creates a JSON file with the list of properties, the number of triples of which they are predicate, the minimum and maximum numbers those triples have as values, the number of orders of magnitude they cover, and the number of numbers starting with 1, 2, 3, etc.

下面的简单python脚本创建了一个JSON文件,其中包含属性列表,它们作为谓词的三元组数,这些三元组作为值的最小和最大数,它们覆盖的数量级数以及数字数以1、2、3等开头

import time
import json
import redata = {}st = time.time()
with open("numbers.nt") as f:
line = f.readline()
count = 0
while line:
p = line.split()[1]
val = int(re.findall("\d+", line.split()[2])[0])
val1 = str(val)[0]
if p not in data:
data[p] = {"i": val, "a": val, "c": 0, "ns": {}}
if val < data[p]["i"]: data[p]["i"] = val
if val > data[p]["a"]: data[p]["a"] = val
if val1 not in data[p]["ns"]: data[p]["ns"][val1] = 0
data[p]["ns"][val1] += 1
data[p]["c"] += 1
count += 1
line = f.readline()
if count % 1000000 == 0:
print(str(count/1000000)+" "+str(len(data.keys()))+" "+str(time.time()-st))
st = time.time()
if count % 10000000 == 0:
with open("numbers.json", "w") as f2:
json.dump(data, f2)
with open("numbers.json", "w") as f2:
json.dump(data, f2)

We obtain 1,582 properties in total, representing as many sets of numbers to be tested against Benford’s law. We reduce this to 505 properties as there are several redundant representations of the same relation into properties in Wikidata. We also extract in another script the label (name) and description of each of the properties, so we don’t have to look them up later.

我们获得1,582 总数,代表要根据本福德定律检验的尽可能多的数字。 我们将其简化为505个属性,因为Wikidata中有多个相同关系的冗余表示形式。 我们还将在另一个脚本中提取每个属性的标签(名称)和描述,因此我们以后不必查找它们。

测试Benfordness (Testing for Benfordness)

Now that we have many sets of numbers, and their distributions according to their leading numbers, we can check how much they follow Benford’s law. Several statistical tests can be used to do this. Here, we use a relatively simple one called Chi-Squared. The value of this test for a set of numbers is given by the formula

现在我们有许多数字集,并且根据它们的前导数字进行分布,我们可以检查它们遵循本福德定律的程度。 可以使用几种统计检验来做到这一点。 在这里,我们使用一种相对简单的称为Chi-Squared的方法。 该测试对于一组数字的值由公式给出

χ² = Σᵢ(oᵢ-eᵢ)² / eᵢ

χ²=Σᵢ(oᵢ-eᵢ)²/eᵢ

Where i is the leading number under consideration (1 to 9), oᵢ is the observed value for i (the percentage of numbers in the set that start with i) and eᵢ is the expected value (the percentage of numbers that should start with i according to Benford’s law). The smaller the result is, the more Benford the set of numbers is. The script below calculates the Chi-Squared test on each set of numbers created with the previous script, to check how they fit Benford’s law.

其中, i是所考虑的前导数字(1到9), oi的观测值(以i开头的集合中数字的百分比), eᵢ是期望值(以i开头的数字的百分比)根据本福德定律)。 结果越小,越本福德的组数字。 下面的脚本对使用前一个脚本创建的每组数字计算卡方检验,以检查它们是否符合本福德定律。

import math
import sys
import jsonif len(sys.argv) !=2:
print("provide filename")
sys.exit(-1)es = {
"1": math.log10(1.+1.),
"2": math.log10(1.+(1./2.)),
"3": math.log10(1.+(1./3.)),
"4": math.log10(1.+(1./4.)),
"5": math.log10(1.+(1./5.)),
"6": math.log10(1.+(1./6.)),
"7": math.log10(1.+(1./7.)),
"8": math.log10(1.+(1./8.)),
"9": math.log10(1.+(1./9.))
}print("expected values: "+str(es))data = {}
with open(sys.argv[1]) as f:
data=json.load(f)for p in data:
sum = 0
for n in es:
if n in data[p]["ns"]:
sum += data[p]["ns"][n]
cs = 0.
for n in es:
e = es[n]
a = 0.
if n in data[p]["ns"]:
a = float(data[p]["ns"][n])/float(sum)
cs += (((a-e)**2)/e) # chi-square test
data[p]["f"] = cswith open(sys.argv[1]+".fit.json", "w") as f:
json.dump(data, f)

那么,这是真的吗? (So, is it real?)

The results obtained are available in a Google Spreadsheet for convenience. The scripts and results are also available on Github.

为方便起见,可以在Google Spreadsheet中获得获得的结果。 脚本和结果也可以在Github上获得

The first thing to notice when looking at the results is that our favourite example, population, does very very well. It is in fact the second best fit for Benford’s law with a Chi-Squared value of 0.000445. There are over 600K numbers in this, which just happen to exist, and they follow almost exactly what Benford’s law predicted. With such a low value for the Chi-Squared test and such a large sample, the chances that this could be a coincidence are so small, they are truly impossible to contemplate. It is real.

查看结果时要注意的第一件事是,我们最喜欢的示例population非常好。 实际上,它是本福德定律的第二最佳拟合,卡方值0.000445。 其中有60万个数字正好存在,并且它们几乎完全遵循本福德定律的预测。 由于Chi-Squared检验的值如此之低,而样本量如此之大,这可能是巧合的机会是如此之小,以至于它们实际上是无法考虑的。 真的

Unsurprisingly, several other properties very related to population also all end up in the top-10 most fitting Benford’s law, including literate/illiterate populations, male/female populations or number of households.

毫不奇怪,与人口非常相关的其他几个属性也都排在最适合本福德法则的前十名中,包括文盲/文盲人口男性/女性人口家庭数量

The question I’m sure everybody is dying to see answered is “which property is first then?”, since population is only second. With a Chi-Squared of 0.000344, the first place actually goes to a property called number of visitors which is described as the “number of people visiting a location or an event each year” (48,908 numbers in total).

我敢肯定,每个人都渴望看到这个问题的答案是“那是哪个财产先?”,因为人口仅次于人口。 “ Chi-Squared”为0.000344,实际上排在第一位的是“访客人数” ,即“每年访问某个地点或某个事件的人数”(总共48,908名)。

Amongst the very highly Benford, we also find area (the area occupied by an object), or total valid votes (for elections. The number of blank votes is also doing well on Benfordness).

在非常高的本福德中,我们还可以找到区域(物体所占的面积)或有效选票总数(用于选举。空白票的数量在本福德尼斯上也很不错)。

There seem also to be quite a few properties related to diseases in the highly Benford properties in Wikidata, including number of cases, number of recoveries, number of clinical tests, and number of deaths.

在Wikidata中,本福德具有很高的属性,似乎也有许多与疾病相关的属性,包括病例数,恢复数,临床检查数和死亡数

Numbers related to companies also appear very strongly amongst the top Benford properties. The property employees (the number of employees of a company) is the strongest among those, but we also see patronage, net income, operating income, and total revenue.

在本福德(Benford)的顶级物业中,与公司相关的数字也非常明显。 财产雇员(公司雇员人数)是其中最强的,但我们还会看到光顾净收入营业收入总收入

Sports statistics make a good appearance, with total shots in career, career plus-minus rating and total points in career, together with several biology- and other nature-related topics, such as wingspan (of aeroplanes or animals), proper motion (of stars), topographic prominence (i.e. the height of a mountain or hill) or distance from Earth (of astronomical objects).

体育统计一个良好的外观,并在职业生涯职业生涯正负评价总出手数和总分在职业生涯中,有几个biology-和其他性质相关的主题,如翼展(飞机或动物),适当的运动联系在一起(的星星),地形凸显度(即山或山的高度)或与地球的距离(天文物体的距离)。

There are of course many more properties that fit Benford’s law very well: The ones above only cover the very top most fitting sets of numbers (with a Chi-Squared below 0.01). They are also not particularly surprising as they match very well the characteristics of sets of numbers that should normally follow Benford’s law: They are large (the smallest, number of blank votes, still contains 886 numbers), cover several orders of magnitudes (from 3 to 80) and, more importantly, are naturally occurring: They are not generated through any systematic process. They just emerged.

当然,还有更多其他属性非常符合本福德定律:上面的那些仅覆盖最适合的一组数字(Chi-Squared小于0.01)。 它们也并不特别令人惊讶,因为它们与通常应遵循本福德定律的一组数字的特征非常吻合:它们很大(最小的空白票数仍包含886个数字),涵盖几个数量级(从3开始)至80),而且更重要的是自然发生的:它们不是通过任何系统的过程生成的。 他们刚刚出现。

There is one very significant exception to this however. With a Chi-Squared value of 0.00407, and 3,259 numbers covering 7 orders of magnitude, we find the property Alexa rank. This corresponds to the ranking of websites from the Alexa internet service, which provides information about websites based on traffic and audiences. It is very hard to explain how it could fit so well, since, being a ranking, it should normally be linearly distributed from 1 to its maximum value. There are two possible explanations however as to why this happened: 1- For a given website, several rankings might be available for several years, and 2- not all websites ranked by Alexa are in Wikidata. In other words, it is not the ranking itself that follows Benford’s law, but the naturally occurring selection of rankings in Wikidata. The same kind of things, worryingly, might affect other results too and is a good example demonstrating how any dataset might be biased in a way that seriously affects the results of statistical analyses.

但是,有一个非常重要的例外。 卡方值为0.00407,具有3,259个覆盖7个数量级的数字,我们找到了Alexa等级属性 这与Alexa互联网服务对网站的排名相对应,该服务根据流量和访问者提供有关网站的信息。 很难解释它如何如此适合,因为作为排名,它通常应从1线性分布到其最大值。 但是,对于发生这种情况的原因有两种可能的解释:1-对于给定的网站,可能会在几年内提供多个排名,并且2-并非所有由Alexa排名的网站都在Wikidata中。 换句话说,不是排名本身遵循本福德定律,而是Wikidata中自然发生的排名选择。 令人担忧的是,同样的事情也可能会影响其他结果,这是一个很好的例子,展示了如何以严重影响统计分析结果的方式对任何数据集进行偏倚。

不好的呢? (How about the bad ones?)

So, we have verified that Benford’s law indeed applies to many naturally occurring sets of numbers, and even sometimes to naturally occurring selections of non-naturally occurring numbers. What about the cases when it does not work?

因此,我们已经验证了本福德定律确实适用于许多自然发生的数字集,甚至有时适用于非自然发生的数字的自然发生的选择。 如果它不起作用怎么办?

First, we can eliminate all the sets that are too small (less than 100 numbers) and that cover too few orders of magnitude (less than 3). Those, in most cases, would not fit at all, and if they do, we cannot rule out the possibility that it is just a coincidence.

首先,我们可以消除所有太小(少于100个数字)且涵盖太少数量级(少于3个)的集合。 在大多数情况下,这些根本不适合,如果可以,我们不能排除这只是巧合的可能性。

The worst-fitting property of the whole set we looked at is lowest atmospheric pressure, described as the “minimum pressure measured or estimated for a storm (a measure of strength for tropical cyclones)”, with a Chi-Squared of 12.6. The set contains 1,783 numbers varying from 4 to 1,016. While this is a naturally occurring set of numbers that match the needed characteristics, it is easy to see why it does not fit. Atmospheric pressure does not usually vary that much, and it can be expected that most of the values are actually close to the average sea-level pressure (1013 mbar). It is even possible that the value of 4 is simply an error in the data.

我们研究的整套装置的最差拟合性能是最低气压,被描述为“为风暴测得或估算的最低压力(热带气旋的强度度量)”,Chi-Squared为12.6。 该集合包含1,783个数字,范围从4到1,016。 尽管这是一组自然出现的数字,与所需的特征相匹配,但很容易看出为什么它不合适。 大气压力通常变化不大,可以预期大多数值实际上都接近平均海平面压力(1013 mbar)。 甚至4的值可能只是数据中的错误。

Many other of the non-fitting properties can be explained similarly: Their values usually don’t vary enough for them to follow Benford’s law, but exceptions and outliers make that they still span several orders of magnitude. Those include wheelbase (distance between front wheels and rear wheels on a vehicle), life expectancy (of species), field of view (of a device for example), or mains voltage (in a country or region).

许多其他不适合的性质也可以用类似的方式进行解释:它们的值通常变化不大,无法遵循本福德定律,但由于例外情况和异常值使得它们仍然跨越几个数量级。 那些包括轴距寿命(物质)的视场(例如,设备)的电源电压(在一个国家或地区)(前轮和后轮之间在车辆上的距离),,,或。

Interestingly (maybe because I know nothing about astronomy), while related to natural phenomena, many of the badly fitting properties correspond to measures related to planets or space: semi-major axis of an orbit, effective temperature (of star or planet), apoapsis (distance at which a celestial body is the farthest to the object it orbits), periapsis (distance at which a celestial body is the closest to the object it orbits), metallicity (abundance of elements that are heavier than hydrogen or helium in an astronomical object) and orbital period (the time taken for a given astronomic object to make one complete orbit about another object). Maybe Benford’s law is a law of Earth’s nature, or even of human nature, rather than a law of the universe.

有趣的是(也许因为我对天文学一无所知),尽管与自然现象有关,但许多不合适的性质却与与行星或太空有关的量度相对应:轨道的半长轴, (恒星或行星的)有效温度阿朴s(天体离它所绕行的物体最远的距离), perapsis (天体离它所绕行的物体最近的距离),金属性(金属中比氢或氦重的元素的丰度)天文物体)和轨道周期(给定的天文物体绕另一个物体完成一个完整轨道所花费的时间)。 也许本福德定律是地球自然乃至人性的定律,而不是宇宙的定律。

If that’s the case however, there is an interesting exception here: Within the really badly fitting properties, we find number of viewers/listeners (number of viewers of a television or broadcasting program; web traffic on websites). This set includes 248 numbers, ranging from 5,318 to 6 billion. This feels very paradoxical considering that the Alexa rank (which is related to the size of the audience for websites) was also our exception for well-fitting properties. Maybe the same explanation applies however: If we had a complete set of numbers of subscribers/viewers for everything, it might follow Benford’s law very well, but we don’t and the bias introduced by the selection of those 248 ones, which might focus on the most noticeable websites and programs, is a sufficiently unnatural constraint that it makes it lose its Benfordness.

但是,如果是这种情况,这里会有一个有趣的例外:在真正不合适的属性中,我们发现了观众/听众的数量(电视或广播节目的观众数量;网站上的网络访问量)。 该集合包括248个数字,范围从5,318到60亿。 考虑到Alexa排名(与网站的受众群体数量有关)也是我们对合适房产的例外,这感觉很矛盾。 但是,也许可以使用相同的解释:如果我们拥有所有内容的完整订阅者/观看者人数,它可能会很好地遵循本福德定律,但我们并没有选择那248个关注者而引起的偏见。在最引人注目的网站和程序上,是一个非常不自然的约束,使其失去了Benfordness。

所以? (So?)

Benford’s law is weird. It is unexpected and not well explained, but somehow actually works. There is clearly a category of sets of numbers that are supposed to follow Benford’s law, and in effect very much do so. There are also others that apparently don’t and in most cases, it is relatively easy to know why. Interestingly however, there are a few cases where numbers don’t behave the way they should with respect to Benford’s law. As mentioned at the beginning of this article, this phenomenon as been extensively used to detect when numbers have been doctored, by checking for numbers that should follow Benford’s law and in effect do not. It seems that there is another category of applications looking at selections of numbers that should not follow Benford’s law. If they do, it might be an indication that what may very well look like random sampling somehow emerges to be biased by the natural tendencies that uncontrolled human and natural processes have to produce highly Benford numbers. Knowing this, and in which category a dataset is supposed to fall could be very useful to test data against such biases and the representativeness of data samples.

本福德定律很奇怪。 这是出乎意料的,没有得到很好的解释,但是实际上可以工作。 显然有一组数字应该遵循本福德定律,实际上实际上是这样做的。 还有其他一些显然没有的原因,在大多数情况下,相对容易知道原因。 然而有趣的是,在某些情况下,数字在本福德定律方面的表现不尽如人意。 如本文开头所述,这种现象已被广泛用于通过检查应遵循本福德定律的数字来检测何时篡改了数字,而实际上却不这样做。 似乎还有另一类应用程序正在研究不符合本福德定律的数字选择。 如果这样做的话,这可能表明,很可能看起来像随机抽样的某种方式由于不受控制的人为和自然过程必须生成高本福德数的自然趋势而出现偏差。 知道这一点以及将数据集归为哪一类,对于针对此类偏差和数据样本的代表性测试数据非常有用。

And of course, there is the case of astronomy. I suspect the answer would simply emphasize my own ignorance in the matter, but I would really like to know why all those measures related to astronomical bodies so stubbornly refuse to obey Benford’s law.

当然,还有天文学的情况。 我怀疑答案只会简单地说明我对此事的无知,但我真的很想知道为什么所有与天体有关的措施都如此顽固地拒绝遵守本福德定律。

翻译自: https://towardsdatascience.com/what-does-doesnt-follow-benford-s-law-7d0b3c14afa5

本福德法则 2位数

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值