查询语言模型

最新推荐文章于 2020-09-11 12:30:17 发布

zkq_1986

最新推荐文章于 2020-09-11 12:30:17 发布

阅读量362

点赞数

分类专栏： NLP

本文链接：https://blog.csdn.net/zkq_1986/article/details/77269759

版权

NLP 专栏收录该内容

80 篇文章 11 订阅

订阅专栏

Query Language Model

1 TFIDF

在一份给定的文件里，词频（term frequency，TF）指的是某一个给定的词语在该文件中出现的频率。这个数字是对词数(term count)的归一化，以防止它偏向长的文件。（同一个词语在长文件里可能会比短文件有更高的词数，而不管该词语重要与否。）对于在某一特定文件里的词语 $t_{i}$ 来说，它的重要性可表示为：

$\mathrm{tf_{i,j}} = \frac{n_{i,j}}{\sum_k n_{k,j}}$

以上式子中 $n_{i,j}$ 是该词 $t_{i}$ 在文件 $d_{j}$ 中的出现次数，而分母则是在文件 $d_{j}$ 中所有字词的出现次数之和。

逆向文件频率（inverse document frequency，IDF）是一个词语普遍重要性的度量。某一特定词语的IDF，可以由总文件数目除以包含该词语之文件的数目，再将得到的商取对数得到：

$\mathrm{idf_{i}} = \log \frac{|D|}{|\{j: t_{i} \in d_{j}\}|}$

其中

|D|：语料库中的文件总数
$|\{ j: t_{i} \in d_{j}\}|$ ：包含词语 $t_{i}$ 的文件数目（即 $n_{i,j} \neq 0$ 的文件数目）如果该词语不在语料库中，就会导致被除数为零，因此一般情况下使用 $1 + |\{j : t_{i} \in d_{j}\}|$

然后

$\mathrm{tf{}idf_{i,j}} = \mathrm{tf_{i,j}} \times \mathrm{idf_{i}}$

2 BM25

考虑的是tf, qtf，和文档长度

Given a query $Q$ , containing keywords $q_1, ..., q_n$ , the BM25 score of a document $D$ is:

{\text{score}}(D,Q)=\sum _{i=1}^{n}{\text{IDF}}(q_{i})\cdot {\frac {f(q_{i},D)\cdot (k_{1}+1)}{f(q_{i},D)+k_{1}\cdot \left(1-b+b\cdot {\frac {|D|}{\text{avgdl}}}\right)}},

{\text{score}}(D,Q)=\sum _{i=1}^{n}{\text{IDF}}(q_{i})\cdot {\frac {f(q_{i},D)\cdot (k_{1}+1)}{f(q_{i},D)+k_{1}\cdot \left(1-b+b\cdot {\frac {|D|}{\text{avgdl}}}\right)}},

where $f(q_i, D)$ is $q_{i}$ 's term frequency in the document $D$ , $|D|$ is the length of the document $D$ in words, and $avgdl$ is the average document length in the text collection from which documents are drawn. $k_{1}$ and $b$ are free parameters, usually chosen, in absence of an advanced optimization, as $k_1 \in [1.2,2.0]$ and $b = 0.75$ .^[1] $\text{IDF}(q_i)$ is the IDF (inverse document frequency) weight of the query term $q_{i}$ . It is usually computed as:

{\text{IDF}}(q_{i})=\log {\frac {N-n(q_{i})+0.5}{n(q_{i})+0.5}},

\text{IDF}(q_i) = \log \frac{N - n(q_i) + 0.5}{n(q_i) + 0.5},

where $N$ is the total number of documents in the collection, and $n(q_i)$ is the number of documents containing $q_{i}$ .

3 Query likelihood

Rank documents by the probability that the query could be generated by the document model (i.e. same topic)

Given query, start with P(D|Q)

Using Bayes’ Rule

Assuming prior is uniform, unigram model

【Jelinek-Mercer Smoothing】

C_q_i：q_i在语料中出现的次数；|C|：语料中总词数（不是词汇数，相同的词可算多次）

【Dirichlet Smoothing】

4 K-L Divergence

描述两个分布的差异程度

zkq_1986

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
查询语言模型

1 TFIDF 在一份给定的文件里，词频（term frequency，TF）指的是某一个给定的词语在该文件中出现的频率。这个数字是对词数(term count)的归一化，以防止它偏向长的文件。（同一个词语在长文件里可能会比短文件有更高的词数，而不管该词语重要与否。）对于在某一特定文件里的词语来说，它的重要性可表示为：以上式子中是该词在文件中的出现
复制链接

扫一扫