python 广义似然比_如何解释Python NLTK bigram似然比？

最新推荐文章于 2023-06-21 15:33:28 发布

weixin_39947396

最新推荐文章于 2023-06-21 15:33:28 发布

阅读量112

点赞数

文章标签： python 广义似然比

本文链接：https://blog.csdn.net/weixin_39947396/article/details/111777365

版权

I'm trying to figure out how to properly interpret nltk's "likelihood ratio" given the below code (taken from this question).

import nltk.collocations

import nltk.corpus

import collections

bgm = nltk.collocations.BigramAssocMeasures()

finder = nltk.collocations.BigramCollocationFinder.from_words(nltk.corpus.brown.words())

scored = finder.score_ngrams(bgm.likelihood_ratio)

# Group bigrams by first word in bigram.

prefix_keys = collections.defaultdict(list)

for key, scores in scored:

prefix_keys[key[0]].append((key[1], scores))

for key in prefix_keys:

prefix_keys[key].sort(key = lambda x: -x[1])

prefix_keys['baseball']

With the following output:

[('game', 32.11075451975229),

('cap', 27.81891372457088),

('park', 23.509042621473505),

('games', 23.10503351305401),

("player's", 16.22787286342467),

('rightfully', 16.22787286342467),

[...]

Looking at the docs, it looks like the likelihood ratio printed next to each bigram is from

"Scores ngrams using likelihood ratios as in Manning and Schutze

5.3.4."

Referring to this article, which states on pg. 22:

One advantage of likelihood ratios is that they have a clear intuitive

interpretation. For example, the bigram powerful computers is

e^(.5*82.96) = 1.3*10^18 times more likely under the hypothesis that

computers is more likely to follow powerful than its base rate of

occurrence would suggest. This number is easier to interpret than the

scores of the t test or the 2 test which we have to look up in a

table.

What I'm confused about is what would be the "base rate of occurence" in the event that I'm using the nltk code noted above with my own data. Would it be safe to say, for example, that "game" is 32 times more likely to appear next to "baseball" in the current dataset than in the average use of the standard English language? Or is it that "game" is more likely to appear next to "baseball" than other words appearing next to "baseball" within the same set of data?

Any help/guidance towards a clearer interpretation or example is much appreciated!

解决方案

nltk does not have a universal corpus of English language usage from which to model the probability of 'game' following 'baseball'.

The likelihood scores reflect the likelihood, within the corpus, of each of those result grams being preceded by the word 'baseball'.

base rate of occurrence would describe how often the word game occurs after baseball throughout the corpus, without taking into consideration the frequency of baseball or game throughout the corpus.

nltk.corpus.brown

is a built in corpus, or set of observations, and the predictive power of any probability-based model is entirely defined by the observations used to construct or train it.

UPDATE in response to OP comment:

As in 32% of 'game' occurrences are preceded by 'baseball'. This is slightly misleading, and the likelihood score does not directly model a frequency distribution of the bigram.

nltk.collocations.BigramAssocMeasures().raw_freq

models raw frequency with t tests, not well suited to sparse data such as bigrams, thus the provision of the likelihood ratio.

The likelihood ratio as calculated by Manning and Schutze is not equivalent to frequency.

Section 5.3.4 describes their calculations in detail on how the calculation is done.

They take into account frequency of word one in the document, frequency of word two in the document, and frequency of the bigram in the document in a manner that is well-suited to sparse matrices like corpus matrices.

If you are familiar with the TF-IDF vectorization method, this ratio aims for something similar as far as normalizing noisy features.

The score can be infinitely large. The relative difference between scores reflects those inputs I just described (corpus frequencies of word 1, word 2 and word1word2).

This chart is the most intuitive piece of their explanation, unless you're a statistician:

The likelihood score is calculated as the leftmost column.

weixin_39947396

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫