python 广义似然比_如何解释Python NLTK bigram似然比?

I'm trying to figure out how to properly interpret nltk's "likelihood ratio" given the below code (taken from this question).

import nltk.collocations

import nltk.corpus

import collections

bgm = nltk.collocations.BigramAssocMeasures()

finder = nltk.collocations.BigramCollocationFinder.from_words(nltk.corpus.brown.words())

scored = finder.score_ngrams(bgm.likelihood_ratio)

# Group bigrams by first word in bigram.

prefix_keys = collections.defaultdict(list)

for key, scores in scored:

prefix_keys[key[0]].append((key[1], scores))

for key in prefix_keys:

prefix_keys[key].sort(key = lambda x: -x[1])

prefix_keys['baseball']

With the following output:

[('game', 32.11075451975229),

('cap', 27.81891372457088),

('park', 23.509042621473505),

('games', 23.10503351305401),

("player's", 16.22787286342467),

('rightfully', 16.22787286342467),

[...]

Looking at the docs, it looks like the likelihood ratio printed next to each bigram is from

"Scores ngrams using likelihood ratios as in Manning and Schutze

5.3.4."

Referring to this article, which states on pg. 22:

One advantage of likelihood ratios is that they have a clear intuitive

interpretation. For example, the bigram powerful computers is

e^(.5*82.96) = 1.3*10^18 times more likely under the hypothesis that

computers is more likely to follow powerful than its base rate of

occurrence would suggest. This number is easier to interpret than the

scores of the t test or the 2 test which we have to look up in a

table.

What I'm confused about is what would be the "base rate of occurence" in the event that I'm using the nltk code noted above with my own data. Would it be safe to say, for example, that "game" is 32 times more likely to appear next to "baseball" in the current dataset than in the average use of the standard English language? Or is it that "game" is more likely to appear next to "baseball" than other words appearing next to "baseball" within the same set of data?

Any help/guidance towards a clearer interpretation or example is much appreciated!

解决方案

nltk does not have a universal corpus of English language usage from which to model the probability of 'game' following 'baseball'.

The likelihood scores reflect the likelihood, within the corpus, of each of those result grams being preceded by the word 'baseball'.

base rate of occurrence would describe how often the word game occurs after baseball throughout the corpus, without taking into consideration the frequency of baseball or game throughout the corpus.

nltk.corpus.brown

is a built in corpus, or set of observations, and the predictive power of any probability-based model is entirely defined by the observations used to construct or train it.

UPDATE in response to OP comment:

As in 32% of 'game' occurrences are preceded by 'baseball'. This is slightly misleading, and the likelihood score does not directly model a frequency distribution of the bigram.

nltk.collocations.BigramAssocMeasures().raw_freq

models raw frequency with t tests, not well suited to sparse data such as bigrams, thus the provision of the likelihood ratio.

The likelihood ratio as calculated by Manning and Schutze is not equivalent to frequency.

Section 5.3.4 describes their calculations in detail on how the calculation is done.

They take into account frequency of word one in the document, frequency of word two in the document, and frequency of the bigram in the document in a manner that is well-suited to sparse matrices like corpus matrices.

If you are familiar with the TF-IDF vectorization method, this ratio aims for something similar as far as normalizing noisy features.

The score can be infinitely large. The relative difference between scores reflects those inputs I just described (corpus frequencies of word 1, word 2 and word1word2).

This chart is the most intuitive piece of their explanation, unless you're a statistician:

The likelihood score is calculated as the leftmost column.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值