齐普夫定律（Zipf‘s Law）

彬彬侠

于 2025-01-30 20:42:30 发布

阅读量1.7k

点赞数 25

分类专栏：自然语言处理基础文章标签：齐普夫定律 Zipf’s Law 单词频率排名 Python NLP 自然语言处理

本文链接：https://blog.csdn.net/u013172930/article/details/145400647

版权

自然语言处理基础专栏收录该内容

69 篇文章

订阅专栏

齐普夫定律（Zipf’s Law）

1. 定义

齐普夫定律（Zipf’s Law） 是一种经验法则，描述了 单词频率分布 在自然语言中的规律。它指出，在一篇文本或一个语料库中，单词的出现频率 $f$ 与其频率排名 $r$ 之间存在如下关系：

$\propto \frac{1}{r^s}$

其中：

$f$ 是单词的出现频率。
$r$ 是单词的排名（按照频率从高到低排序）。
$s$ 是一个常数，通常在自然语言中接近 1（即 $\approx 1$ ）。

换句话说，在大多数语言中，第 $r$ 频繁的单词的出现次数，大约是第 $r + 1$ 频繁单词的 2 倍，是第 $r + 2$ 频繁单词的 3 倍，以此类推。

2. 齐普夫定律的数学表达

对 Zipf’s Law 进行对数变换：

$\log f = \log C - s \log r$

在双对数坐标系（log-log plot）上，词频 $f$ 和排名 $r$ 之间的关系应该近似为一条斜率为 $- s$ 的直线。

3. 齐普夫定律的示例

假设在一个英语文本中，最常见的单词是 “the”，它的出现频率是 10%，那么：

第二常见的单词可能是 “of”，它的出现频率约为 5%。
第三常见的单词可能是 “and”，它的出现频率约为 3.3%。
依次类推，单词的频率随着排名的增加按幂律衰减。

示例词频排名（英语文本）：

排名 $r$	词	词频 $f$
1	the	10.0%
2	of	5.0%
3	and	3.3%
4	to	2.5%
5	a	2.0%
…	…	…

4. 齐普夫定律的应用

齐普夫定律广泛应用于：

自然语言处理（NLP）
- 用于 词频分析，帮助优化文本压缩、信息检索和搜索引擎优化（SEO）。
- 词向量建模时，可以利用 Zipf’s Law 选择高频词进行降维处理（如 Word2Vec 的负采样）。
信息检索与搜索引擎
- 高频词（如 “the”、“is”）提供的信息量较低，而低频词更具区分性，因此信息检索系统会降低高频词的权重（如 TF-IDF 方法）。
文本压缩
- 由于文本数据中的单词分布遵循 Zipf’s Law，可以利用 Huffman 编码等方法进行更高效的文本存储。
社会学 & 经济学
- 在 城市规模、公司收入、网站流量 等领域，齐普夫定律也常被用来描述幂律分布的现象。

5. Python 代码实现

我们可以使用 Python 统计一个文本的单词频率，并绘制 Zipf’s Law 的分布曲线。

(1) 计算单词频率并排序

import re
from collections import Counter
import matplotlib.pyplot as plt
import numpy as np

# 示例文本
text = """
Zipf’s law states that the frequency of a word is inversely proportional to its rank.
The most common words appear very frequently, while rare words appear infrequently.
This pattern holds in many natural languages.
"""

# 预处理文本：转换为小写 & 去除标点符号
text = text.lower()
text = re.sub(r'[^\w\s]', '', text)

# 统计单词频率
words = text.split()
word_counts = Counter(words)

# 按照频率排序
sorted_word_counts = sorted(word_counts.items(), key=lambda x: x[1], reverse=True)

# 打印前 10 个高频单词
print("Top 10 frequent words:")
for i, (word, freq) in enumerate(sorted_word_counts[:10]):
    print(f"{i+1}. {word}: {freq}")

(2) 绘制 Zipf’s Law 曲线

# 提取排名和频率
ranks = np.arange(1, len(sorted_word_counts) + 1)  # 词频排名
frequencies = [freq for word, freq in sorted_word_counts]

# 绘制词频分布
plt.figure(figsize=(8, 5))
plt.loglog(ranks, frequencies, marker="o", linestyle="none", color="blue", label="Observed")

# 拟合 Zipf’s Law 直线
slope, intercept = np.polyfit(np.log(ranks), np.log(frequencies), 1)
plt.plot(ranks, np.exp(intercept) * ranks ** slope, color="red", linestyle="dashed", label=f"Fit: slope={slope:.2f}")

plt.xlabel("Rank (log scale)")
plt.ylabel("Frequency (log scale)")
plt.title("Zipf's Law in Word Frequency")
plt.legend()
plt.show()