拜大洋彼岸国家大选所赐,本福特定律这两天在微博和知乎上突然火了起来,因为有人根据本福特定律推断大选投票结果有人为操纵的痕迹。那么什么是本福特定律呢?本福特定律的内容大致是这样的:对于一组样本足够大,且难以人为操控的自然十进制数据,如世界上所有国家的人口、面积、GDP等,其首位为n的数据占数据总量的比例为:
P ( n ) = log 10 ( n + 1 n ) P(n)=\log_{10}\left(\frac{n+1}{n}\right) P(n)=log10(nn+1)
计算下来,以1打头的数据占数据总量的30.10%,以2打头的数据占数据总量的17.61%,以3打头的数据占数据总量的12.49%,依次类推:
n | P(n) |
---|---|
1 | 30.10% |
2 | 17.61% |
3 | 12.49% |
4 | 9.69% |
5 | 7.92% |
6 | 6.69% |
7 | 5.80% |
8 | 5.12% |
9 | 4.58% |
听起来是不是很神奇?按照一般人的直觉可能会认为1~9打头的数据比例都是相等的,为1/9,但实际并不是。李永乐老师的一个视频对这个定律有比较细致的讲解,里面有一些实例分析和一个非正式的证明,感兴趣的可以自己去微博搜索。
我在csdn这些博客的阅读量数据是否符合这个定律呢?写了个bash脚本统计了一下(其实就一行):
curl -s https://blog.csdn.net/imred/article/list/{1,2,3} | grep "read-num" | grep "readCountWhite.png" | awk -F'[<>]' '{ print $5 }' | sort | cut -c1 | uniq -c | awk 'BEGIN { TOTAL=0 } { arr[$2]=$1; TOTAL+=$1 } END { for (i in arr) print i, arr[i]/TOTAL*100}' | awk '{ ACTUALS=sprintf("%-*s", $2, ""); gsub(" ", "=", ACTUALS); STD=100*log((NR+1)/NR)/log(10); STDS=sprintf("%-*s", STD, ""); gsub(" ", "+", STDS); printf("%-2s%6.2f%% %s\n%-2s%6.2f%% %s\n", "", STD, STDS, $1, $2, ACTUALS)}'
其输出为(+
代表本福特定律计算得到的期望值,=
代表实际统计结果):
30.10% ++++++++++++++++++++++++++++++
1 36.61% ====================================
17.61% +++++++++++++++++
2 16.07% ================
12.49% ++++++++++++
3 13.39% =============
9.69% +++++++++
4 9.82% =========
7.92% +++++++
5 6.25% ======
6.69% ++++++
6 5.36% =====
5.80% +++++
7 7.14% =======
5.12% +++++
8 3.57% ===
4.58% ++++
9 1.79% =
可以看出来虽然数据有一些出入,但是趋势还是非常相似的。
上面是我实现的初版代码,博客地址和博客总页数硬编码在代码中,为了方便能用这个脚本根据csdn用户id统计任意博主的数据,以及本地数据,做了一点“小”修改(benford):
#!/bin/bash
#################################################################################
# MIT License
#
# Copyright (c) 2020 Jia Lihong
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.
#################################################################################
#
# Usage: $(basename $0) CSDN_ID # check csdn blog view count
# $(basename $0) -r FILE # check local raw data
# Check whether the data meets Benford's law.
#
display_help() {
echo "Usage: $(basename "$0") CSDN_ID # check csdn blog view count"
echo " $(basename "$0") -r FILE # check local raw data"
echo "Check whether the data meets Benford's law."
}
benford() {
local -r RAW_DATA=$*
local -r DIST=$(echo "${RAW_DATA}" | awk '/[1-9]/ {print $1}' | sort | cut -c1 | uniq -c)
if [[ "${DIST}" == "" ]]; then
echo "Empty data"
return
fi
local -r PERCENT=$(echo "${DIST}" | awk 'BEGIN { TOTAL=0; for (i=1; i<=9; i++) arr[i]=0; } { arr[$2]=$1; TOTAL+=$1 } END { for (i in arr) print i, arr[i]/TOTAL*100 }')
echo "${PERCENT}" | awk '{ ACTUALS=sprintf("%-*s", $2, ""); gsub(" ", "=", ACTUALS); STD=100*log((NR+1)/NR)/log(10); STDS=sprintf("%-*s", STD, ""); gsub(" ", "+", STDS); printf("%-2s%6.2f%% %s\n%-2s%6.2f%% %s\n", "", STD, STDS, $1, $2, ACTUALS)}'
}
csdn_data() {
local -r ID=$1
echo "csdn id: ${ID}" >&2
local -r URL_PREFIX="https://blog.csdn.net/${ID}/article/list/"
echo "url prefix: ${URL_PREFIX}" >&2
local -r HTML="$(curl -s "${URL_PREFIX}")"
local -r PAGE_SIZE="$(echo "${HTML}" | grep "pageSize" | grep -E -o "[0-9]+")"
echo "page size: ${PAGE_SIZE}" >&2
local -r LIST_TOTAL="$(echo "${HTML}" | grep "listTotal" | grep -E -o "[0-9]+")"
echo "list total: ${LIST_TOTAL}" >&2
if (( LIST_TOTAL == 0 || PAGE_SIZE == 0 )); then
echo ""
return
fi
local -r PAGE_COUNT="$(( (LIST_TOTAL - 1) / PAGE_SIZE + 1 ))"
echo "page count: ${PAGE_COUNT}" >&2
local -r RAW_DATA=$(curl "${URL_PREFIX}{$(seq -s, 1 ${PAGE_COUNT})}" | grep "read-num" | grep "readCountWhite.png" | awk -F'[<>]' '{ print $5 }')
echo "${RAW_DATA}"
}
main() {
DISPLAY_HELP="false"
RAW_FILE=""
while getopts "hr:" opt
do
case "$opt" in
h) DISPLAY_HELP="true" ;;
r) RAW_FILE="${OPTARG}" ;;
*) echo "Unknown option $opt"; exit 1 ;;
esac
done
if [[ "${DISPLAY_HELP}" == "true" ]]; then
display_help
exit 0
fi
if [[ "${RAW_FILE}" == "" ]]; then
shift $(( OPTIND - 1))
IDS="imred"
if [ -n "$1" ]; then
IDS="$*"
fi
for CSDN_ID in ${IDS}; do
echo "===================="
RAW_DATA=$(csdn_data "${CSDN_ID}")
benford "${RAW_DATA}"
done
else
if ! [ -e "${RAW_FILE}" ]; then
echo "File doesn't exist: ${RAW_FILE}"
exit 1
fi
RAW_DATA=$(cat "${RAW_FILE}")
benford "${RAW_DATA}"
fi
}
main "$@"
用这个脚本扫了几个干货比较多的博主(我觉得这些博主不太可能有人为控制阅读量的行为),发现少数几个和期望偏差相对比较大:
30.10% ++++++++++++++++++++++++++++++
1 27.66% ===========================
17.61% +++++++++++++++++
2 17.02% =================
12.49% ++++++++++++
3 15.96% ===============
9.69% +++++++++
4 1.06% =
7.92% +++++++
5 9.57% =========
6.69% ++++++
6 14.89% ==============
5.80% +++++
7 3.19% ===
5.12% +++++
8 6.38% ======
4.58% ++++
9 4.26% ====
30.10% ++++++++++++++++++++++++++++++
1 14.49% ==============
17.61% +++++++++++++++++
2 23.19% =======================
12.49% ++++++++++++
3 13.04% =============
9.69% +++++++++
4 13.77% =============
7.92% +++++++
5 12.32% ============
6.69% ++++++
6 7.97% =======
5.80% +++++
7 7.97% =======
5.12% +++++
8 5.07% =====
4.58% ++++
9 2.17% ==
但是这些博主中博客阅读量符合本福特定律的又确实占了大多数,所以也不能说这些数据完全没有意义。偏差大的原因可能是样本数量不够大以及一些其他我们没有考虑到的因素。
最后是在排行榜上找到的两个“大V”博主比较“出众”的数据:
30.10% ++++++++++++++++++++++++++++++
1 10.49% ==========
17.61% +++++++++++++++++
2 5.99% =====
12.49% ++++++++++++
3 20.22% ====================
9.69% +++++++++
4 18.73% ==================
7.92% +++++++
5 14.61% ==============
6.69% ++++++
6 13.48% =============
5.80% +++++
7 7.12% =======
5.12% +++++
8 5.62% =====
4.58% ++++
9 3.75% ===
30.10% ++++++++++++++++++++++++++++++
1 3.74% ===
17.61% +++++++++++++++++
2 57.89% =========================================================
12.49% ++++++++++++
3 13.98% =============
9.69% +++++++++
4 6.83% ======
7.92% +++++++
5 5.37% =====
6.69% ++++++
6 7.48% =======
5.80% +++++
7 3.09% ===
5.12% +++++
8 1.30% =
4.58% ++++
9 0.33%
不见得能说明什么问题,但确实又过于“出众”。
(PS:用完bash记得洗手,条件允许时请用医用酒精消毒)