自然语言处理(3)之条件频率分布
条件频率分布式频率分布的集合,每个频率分布有一个不同的条件。
从下面的例子就可以看出,cfd就是两个条件(news,romance)的频率分布集合
1 >>> cfd=nltk.ConditionalFreqDist( 2 ... (genre,word) 3 ... for genre in ['news','romance'] 4 ... for word in brown.words(categories=genre)) 5 >>> cfd 6 <ConditionalFreqDist with 2 conditions> 7 >>> list(cfd['news'])[:4] 8 ['the', ',', '.', 'of'] 9 >>> list(cfd['romance'])[:4] 10 [',', '.', 'the', 'and']
可以用plot()和tabulte()来绘制分布图和分布表
示例:处理布朗此料库的新闻和言情文体,找出一周中最有新闻将至并且是最浪漫的日子.
1 >>> from nltk.corpus import brown 2 >>> days=['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Saturday'] 3 >>> cfd=nltk.ConditionalFreqDist( 4 ... (genre,word) 5 ... for genre in ['news','romance'] 6 ... for word in brown.words(categories=genre)) 7 >>> cfd.tabulate(samples=days,cumulative=True) 8 Monday Tuesday Wednesday Thursday Friday Saturday Saturday 9 news 54 97 119 139 180 213 246 10 romance 2 5 8 9 12 16 20 11 >>> cfd.tabulate(samples=days) 12 Monday Tuesday Wednesday Thursday Friday Saturday Saturday 13 news 54 43 22 20 41 33 33 14 romance 2 3 3 1 3 4 4
NLTK条件频率分布的常用方法
示例 | 描述 |
cfdist=ConditionalFreqDist() | 从配对链表中创建条件频率分布 |
cfdist.conditions() | 将条件按字母排序来分类 |
cfdist[condition] | 此条件下的频率分布 |
cfdist[condition][sample] | 此条件下给定样本的频率 |
cfdist.tabulate() | 为条件频率分布制表 |
cfdist.tabulate(samples,conditions) | 在指定样本和条件限制下制表 |
cfdist.plot() | 为条件频率分布绘图 |
cfdist.plot(samples,conditions) | 在指定样本和条件限制下绘图 |