python Zipf定律-高度偏斜分布

最新推荐文章于 2023-05-08 07:45:00 发布

the only KIrsTEN

最新推荐文章于 2023-05-08 07:45:00 发布

阅读量846

点赞数 1

分类专栏：语音和文本处理(Python) 文章标签： python 自然语言处理语言模型目标检测人工智能

本文链接：https://blog.csdn.net/kirsten111111/article/details/127676967

版权

语音和文本处理(Python) 专栏收录该内容

25 篇文章 2 订阅 ¥9.90 ¥99.00

订阅专栏

超级会员免费看

本文探讨了Zipf定律在语言学中的应用，通过Python进行简单的数据分析和图表绘制来验证该定律。Zipf定律指出，语言元素（如单词）的频率与其排名成反比。文章中，作者统计了《白鲸记》文本中的单词频率，并在对数-对数图上绘制数据以观察幂律关系。通过使用Pylab模块，展示了如何绘制和分析这种图形，以展示语言中的高度偏斜分布。

摘要由CSDN通过智能技术生成

python Zipf定律-高度偏斜分布

这里将向你介绍一个众所周知的经验定律，它被称为Zipf定律。该定律暗示了语言元素的高度偏斜分布（在本例中为词语）在许多情况下都很重要（例如面向单词的文本压缩方法）。我们将尝试验证其正确性通过一些简单的数据分析和图表绘制。一路上，我们会收获一些熟悉Python中可用的图形绘制工具。

齐普夫定律是美国语言学家乔治·金斯利·齐普夫（1935）制定的经验定律。它指出，在一个大型语料库中，任何单词的频率都与其排名成反比在频率表中。因此，最频繁的单词出现的频率大约为两倍作为第二频繁词，是第三频繁词的三倍，等等。

例如，在布朗语料库中，最常见的单词（the）几乎占所有单词的7%单词出现率，第二（of）次，占3.5%，以此类推。因此，只有135个排名靠前在布朗语料库中，词汇项目需要占单词出现次数的一半。

Zipf定律是一个幂律的例子。类似的幂律观察也在各地进行许多不同类型的数据，与语言无关。参见维基百科条目中的Zipf定律

作为评估Zipf定律的基础，我们需要一个合理大小的单词频率数据正文。为此文件mobydick

MOBY DICK; OR THE WHALE 

by Herman Melville




ETYMOLOGY.

(Supplied by a Late Consumptive Usher to a Grammar School)

The pale Usher--threadbare in coat, heart, body, and brain; I see him
now.  He was ever dusting his old lexicons and grammars, with a queer
handkerchief, mockingly embellished with all the gay flags of all the
known nations of the world.  He loved to dust his old grammars; it
somehow mildly reminded him of his mortality.

"While you take in hand to school others, and to teach them by what
name a whale-fish is to be called in our tongue leaving out, through
ignorance, the letter H, which almost alone maketh the signification
of the word, you deliver that which is not true." --HACKLUYT

"WHALE. ... Sw. and Dan. HVAL.  This animal is named from roundness
or rolling; for in Dan. HVALT is arched or vaulted." --WEBSTER'S
DICTIONARY

"WHALE. ... It is more immediately from the Dut. and Ger. WALLEN;
A.S. WALW-IAN, to roll, to wallow." --RICHARDSON'S DICTIONARY

KETOS,               GREEK.
CETUS,               LATIN.
WHOEL,               ANGLO-SAXON.
HVALT,               DANISH.
WAL,                 DUTCH.
HWAL,                SWEDISH.
WHALE,               ICELANDIC.
WHALE,               ENGLISH.
BALEINE,             FRENCH.
BALLENA,             SPANISH.
PEKEE-NUEE-NUEE,     FEGEE.
PEKEE-NUEE-NUEE,     ERROMANGOAN.

基本部分:

SE: python <PROGNAME> (options) 
OPTIONS:
    -h :      this help message and exit
    -d FILE : use FILE as data to create a new lexicon file
    -t FILE : apply lexicon to test data in FILE
"""
################################################################

import sys, re, getopt

################################################################
# Command line options handling, and help
opts, args = getopt.getopt(sys.argv[1:], 'hd:t:')

#args = '-a -b -cfoo -d bar a1 a2'.split()
#opts, args = getopt.getopt(args, 'abc:d:')

opts = dict(opts)

def printHelp():
    progname = sys.argv[0]
    progname = progname.split('/')[-1] # strip out extended path
    help = __doc__.replace('<PROGNAME>', progname, 1)
    print('-' * 60, help, '-' * 60, file=sys.stderr)
    sys.exit()
    
if '-h' in opts:
    printHelp()

if '-d' not in opts:
    print("\n** ERROR: must specify training data file (opt: -d FILE) **", file=sys.stderr)
    printHelp()
print(args)
if len(args) > 0:
    print("\n** ERROR: unexpected input on commmand line **", file=sys.stderr)
    printHelp()

编写Python脚本，统计文件中出现的所有单词标记。简单查看标记化-只需使用正则表达式提取最大字母序列并处理这些如文字所示。还将输入映射为小写，以合并大小写变体。把单词数到然后生成按频率计数降序排序的单词列表。让脚本打印出数据文件中出现的单词总数，即发现的不同单词，以及前20个单词及其频率。不出所料，这些是典型的停止词。然而，对于这项任务，保留这些词语是有意义的。

接下来，我们可以忘记数据中的实际单词，只使用它们的频率，按降序排序。首先将这些排序的频率与其等级进行比较位置，即最常用的单词排名为1的位置，以此类推下一页提供了Python中的绘图。尝试为不同的数字绘制此图字数，即前100、1000或全集。

with open('mobydick.txt', "r", encoding="utf-8") as in_file:
    wordRE = re.compile(r'[A-Za-z]+')
    counts = {}
    
    for line in in_file:
        for word in wordRE.findall(line.lower()):
        
            if word not in counts:
                counts[word] = 1
            else:
                counts[word] += 1
    #print(counts)
    tags = sorted(counts, key=lambda x:counts[word], reverse=True)
    for tag in tags[:10]:
        print(tag)

但有一个问题是，它只显示txt文档中的前10项，而不显示频率最高的10项.换句话说,排序功能有错误.

 tags = sorted(counts, key=lambda x:counts[x], reverse=True)
    for tag in tags[:10]:
        print(tag)

使用上面的tag代码能够显示频率最高的10项

在运行代码之前放置配置:

-d mobydixk.txt

正如维基百科上关于Zipf定律的页面所讨论的，幂律关系最容易通过在对数-对数图上绘制数据来观察，轴为对数（秩序）和对数（频率）。数据符合Zipf定律，因为曲线是线性的。绘制图形这种关系。
Pylab是一个现有Python代码库，称为模块，它提供了许多有用的功能，包括图形绘制。基本绘图函数需要两个参数：要绘制的点的x坐标，以及相应的y坐标列表（应为当然与第一列表的长度相同）。要绘制具有（x，y）点（0，1.2）、（1，2.2）和（2，1.8），例如，我们形成x值[0,1,2]和y值[1.2，2.2，1.8]的列表。我们可以然后用以下代码绘制此图：

import pylab as p
X = [0, 1, 2]
Y = [1.2, 2.2, 1.8]
p.plot(X,Y)
p.show()

plot命令采用如上所述的x和y坐标值列表。show命令导致实际绘制和显示的图形。请自己尝试此代码。如果您使用的是IPython
在Spyder的控制台中，图形将显示在控制台窗口中。如果你在终端中运行代码，但是，图形将显示在单独的窗口中（您可能需要单击屏幕底部以将它们带到前面）。默认情况下，线是用连续线绘制的，颜色是随机的。这对Zipf绘图很好任务，但在其他情况下，您可能需要其他格式。为此，plot需要一个可选的第三个参数，这是一个字符串，它允许我们控制绘图格式。例如，在图（Xs，Ys，‘ro-’）中，格式字符串“ro-”给出了一条红色（“r”）连续线（“-”），带圆圈（“o”）标记数据点，但我们可以选择蓝色（“b”）或绿色（“g”）、星号（“*”）或十字（“x”）或直线即虚线（“–”）、点划线（“-”）或不存在（即仅显示数据点）。其他功能
（xlabel，ylabel）允许我们将标签分配给x/y轴（例如xlabel（“时间”））title（标题），或将保存地物的文件（savefig）命名为PNG图像文件以供以后使用。
通过之前多次调用plot函数，我们可以在图形上绘制多条线。在绘图调用之间调用地物会启动新地物，因此当调用show时显示图形。我们可以使用子图函数来排列多个图形
例如：

import pylab as p
X = [0, 1, 2]
Y1 = [1.2, 2.2, 1.8]
Y2 = [1.5, 2.0, 2.6]
p.plot(X,Y1)
p.figure()
p.plot(X,Y2)
p.show()
################
import pylab as p
X = [0, 1, 2]
Y1 = [1.2, 2.2, 1.8]
Y2 = [1.5, 2.0, 2.6]
p.subplot(211)
p.plot(X,Y1)
p.subplot(212)
p.plot(X,Y2)
p.show()

因此，解决方案是:

"""
USE: python <PROGNAME> (options) datafile1 ... datafileN
OPTIONS:
    -h : print this help message and exit
"""
################################################################

import sys, re, getopt
import pylab as p

opts, args = getopt.getopt(sys.argv[1:], 'h')
opts = dict(opts)
filenames = args

if '-h' in opts:
    progname = sys.argv[0]
    progname = progname.split('/')[-1] # strip out extended path
    help = __doc__.replace('<PROGNAME>', progname, 1)
    print('-' * 60, help, '-' * 60, file = sys.stderr)
    sys.exit()

################################################################
# Count words in data file(s)

wordRE = re.compile('\w+')
wdcounts = {}

for filename in filenames:
    with open(filename) as infs:
        for line in infs:
            for wd in wordRE.findall(line.lower()):
                if wd not in wdcounts:
                    wdcounts[wd] = 0
                wdcounts[wd] += 1

################################################################
# Sort words / print top N

words = sorted(wdcounts, reverse = True, key = lambda v:wdcounts[v])
# words = words[:2000] # Truncate (freq sorted) word list
freqs = [wdcounts[w] for w in words]

print()
print('TYPES: ', len(words))
print('TOKENS:', sum(freqs))
print()

topN = 200
for wd in words[:topN]:
    print(wd, ':', wdcounts[wd])

################################################################
# Plot freq vs. rank

ranks = range(1, len(freqs)+1)

p.figure()
p.plot(ranks, freqs)
p.title('freq vs rank')

################################################################
# Plot cumulative freq vs. rank

cumulative = list(freqs) # makes copy of freqs list

for i in range(len(cumulative) - 1):
    cumulative[i + 1] += cumulative[i]

p.figure()
p.plot(ranks, cumulative)
p.title('cumulative freq vs rank')

################################################################
# Plot log-freq vs. log-rank

logfreqs = [p.log(freq) for freq in freqs]
logranks = [p.log(rank) for rank in ranks]

p.figure()
p.plot(logranks, logfreqs)
p.title('log-freq vs log-rank')
p.savefig('log1.png')

################################################################
# Display figures

p.show()

pylab是Matplotlib和Ipython提供的一个模块，提供了类似Matlab的语法，在启动Ipython时可以使用
range()函数：用于生成一个整数序列；
range()的三种创建方式：
第一种：只有一个参数（小括号中只给了一个数）即range(stop) 例如：range(10)指的是默认从0开始，步长为1，不包括10；第二种：range(start,stop) （给了两个参数，即小括号中给了两个数）;第三种：range(start,stop,step)：创建一个在[start,stop)之间，步长为step;

在运行代码之前放置配置:

mobydixk.txt

the only KIrsTEN

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
python Zipf定律-高度偏斜分布

例如，在图（Xs，Ys，‘ro-’）中，格式字符串“ro-”给出了一条红色（“r”）连续线（“-”），带圆圈（“o”）标记数据点，但我们可以选择蓝色（“b”）或绿色（“g”）、星号（“*”）或十字（“x”）或直线即虚线（“–”）、点划线（“-”）或不存在（即仅显示数据点）。要绘制具有（x，y）点（0，1.2）、（1，2.2）和（2，1.8），例如，我们形成x值[0,1,2]和y值[1.2，2.2，1.8]的列表。因此，最频繁的单词出现的频率大约为两倍作为第二频繁词，是第三频繁词的三倍，等等。
复制链接

扫一扫

专栏目录