python 自然语言处理 统计语言建模(1/2)

一、计算单词频率

例子:生成1-gram,2-gram,4-gram的Alpino语料库的分词样本


import nltk  # 1 - gram
from nltk.util import ngrams
from nltk.corpus import alpino
print(alpino.words())
unigrams=ngrams(alpino.words(),1)
for i in unigrams:
    print(i)


import nltk  #2 - gram
from nltk.util import ngrams
from nltk.corpus import alpino
print(alpino.words())
bigrams_tokens=ngrams(alpino.words(),2)
for i in bigrams_tokens:
    print(i) 


import nltk  #4 - gram
from nltk.util import ngrams
from nltk.corpus import alpino
print(alpino.words())
quadgrams=ngrams(alpino.words(),4)
for i in quadgrams:
    print(i)

生成一段文本的2 - grams 和 2 - grams的频数 以及 4 - grams和4 - grams的频数


import nltk  # 2 - grams
from nltk.collocations import *
import nltk
text="Hello how are you doing ? I hope you find the book interesting"
tokens=nltk.wordpunct_tokenize(text)
twograms=nltk.collocations.BigramCollocationFinder.from_words(tokens)
for twogram, freq in twograms.ngram_fd.items():
    print(twogram,freq)


import nltk # 4 - grams
from nltk.collocations import *
import nltk
text="Hello how are you doing ? I hope you find the book interesting"
tokens=nltk.wordpunct_tokenize(text)
fourgrams=nltk.collocations.QuadgramCollocationFinder.from_words(tokens)
for fourgram, freq in fourgrams.ngram_fd.items():
    print(fourgram,freq)

二、NLTK的频率

import nltk
from nltk.probability import FreqDist 
text="How tragic that most people had to get ill before they understood what a gift it was to be alive"
ftext=nltk.word_tokenize(text)
fdist=FreqDist(ftext)

print(fdist.N())#总数
print(fdist.max())#数值最大的样本的频率
print(fdist.freq("How"))#频率

for i in fdist:
    print(i,fdist.freq(i))#输出全部的样本的频率

words=fdist.keys()
print(words)#map中的key

fdist.tabulate()#绘制频数分布图
fdist.plot()

频率、概率之间的关系

在一定的实验情况下频率与概率可以相互替换,比如扔一枚硬币10000次,向上的频数是5023次,概率可以相当于5023/10000,为了获取这些频率之间的分布(概率之间的分布);我们通常用估计来求解

三、NLTK中的概率分布(在nltk中的probability.py文件中,大家可以去拜读)

我们知道了频率就大概知道了概率,概率论中应该有学过估计,利用样本来求解一些方差、期望。这里使用频率来求解概率分布

import nltk #最大似然估计
from nltk.probability import FreqDist, MLEProbDist   
text="How tragic that most people had to get ill before they understood what a gift it was to be alive"
ftext=nltk.word_tokenize(text)
fdist=FreqDist(ftext)
print(MLEProbDist(fdist).max())
print(MLEProbDist(fdist).samples())
for i in MLEProbDist(fdist).freqdist():
    print(i,MLEProbDist(fdist).prob(i))

import nltk   #Lidstone估计
from nltk.probability import FreqDist, LidstoneProbDist   
text="How tragic that most people had to get ill before they understood what a gift it was to be alive"
ftext=nltk.word_tokenize(text)
fdist=FreqDist(ftext)
print(LidstoneProbDist(fdist,0.5).max())
print(LidstoneProbDist(fdist,0.5).samples())
for i in LidstoneProbDist(fdist,0.5).freqdist():
    print(i,LidstoneProbDist(fdist,0.5).prob(i))

还有其他估计函数可以查看文档 probability

 

 

 

 

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值