信息检索导论第十二章笔记(英文)

最新推荐文章于 2022-11-03 15:35:25 发布

Braylon1002

最新推荐文章于 2022-11-03 15:35:25 发布

阅读量537

点赞数 1

分类专栏：数据挖掘文章标签：信息检索导论

本文链接：https://blog.csdn.net/qq_40742298/article/details/108531951

版权

数据挖掘专栏收录该内容

53 篇文章 11 订阅

订阅专栏

文章目录

Language models for information retrieval

Language models for information retrieval

Abstract

Firstly, this chapter intorduce the concept of language models and then describe the basic and most commonly used language modeling approach to IR, the query likelihood model. And compare the language modeling approach and other approaches to IR in this chaper. Finally, it briefly describes various extensions to the language modeling approach.

Language Model

The simplest language model is equivalent to a probabilistic finite automaton.

在这里插入图片描述

A simple finite automaton and some of the strings in the language it generates. shows the start state of the automaton and a double circle indicates a (possible) finishing state.

在这里插入图片描述

A one-state finite automaton that acts as a unigram language model. We show a partial specification of the state emission probabilities.

Types of language models

unigram language model

在这里插入图片描述

for example:

在这里插入图片描述

bigram language models

在这里插入图片描述

which condition on the previous term

在这里插入图片描述

Such models are vital for tasks like speech recognition, spelling correction, and machine translation, where you need the probability of a term conditioned on surrounding context. However, most language-modeling work in IR has used unigram language models.

Multinomial distributions over words

the equations presented above do not present the multinomial probability of a bag of words, because they do not sum over all possible orderings of those words, as is done by the multinomial coefficient (the first term on the righthand side) in the standard presentation of a multinomial model:

在这里插入图片描述

The fundamental problem in designing language models is that we do notknow what exactly we should use as the model $M_d$

We pretend that the documentdis only a representative sample of text drawnfrom a model distribution, treating it like a fine-grained topic. We then esti-mate a language model from this sample, and use that model to calculate the probability of observing any word sequence, and, finally, we rank documentsaccording to their probability of generating the query

The query likelihood model

We construct the corresponding language model $M_d$ for each document D in the document set. Our goal is to sort the documents according to their likelihood P(d|q) associated with the query.

在这里插入图片描述

The most common way to do this is to use the multinomial unigram language model, which is equivalent to a multinomial naive Bayes model,

so we have:

在这里插入图片描述

$K_q$ is the multinomial coefficient for the query q.

For retrieval based on a language model (henceforth LM), we treat the generation of queries as a random process. The approach is to:
1. infer a LM for each document
1. Estimate P(q|Mdi), the probability of generating the query according to each of these document models
1. rank the documents according to these probabolities

Estimating the query generation probability

How to estimate P(q|Md).

The probability of producing the query given the LM Md of document d using maximum likelihood estimation (MLE) and the unigram assumption is:

在这里插入图片描述

approaches to smoothing probability distributions

if $tf_(t,d)$ = 0, then

在这里插入图片描述

where cftis the raw count of the term in the collection, andTis the raw size(number of tokens) of the entire collection.

a simple idea that works well inpractice is to use a mixture between a document-specific multinomial distri-bution and a multinomial distribution estimated from the entire collection:

在这里插入图片描述

Mc is a language model built from the entire document collection

use an LM built from the whole collection as a priordistribution in aBayesian updating process

在这里插入图片描述

Language modeling versus other approachesin information retrieval

The LM approach provides a novel way of looking at the problem of textretrieval, which links it with a lot of recent work in speech and language processing.

The LM approach assumes that documents and experssions of information needs are objects of the same type, and assesses their match by importing the toolsand methods of language modeling from speech and natural language pro-cessing.

The resulting model is mathematically precise, conceptually simple, computationally tractable, and intuitively appealing.

Extended language modeling approaches

You can look at the probability of a query language model $M_q$ generating the document

This way is less appealing is that there is much less text available to estimate a LM based on the query text and is easy to see how to incorporate relevance feedback into such a model.

Rather than directly generating in either direction, we can make an LMfrom both the document and query, and then ask how different these twolanguage models are from each other.

在这里插入图片描述

Three ways of developing the language modeling approach. (a) Query likelihood.
(b) Document likelihood.
© Model comparison.

For instance, one way to model the risk of returning a documentdas relevant to a query q is to use the (KL) divergence between their respective language models:

在这里插入图片描述

intorduce translation models to bridge query-document gap ro address issues of alternate expression.

assume that the translation model can be represented by a conditional probability distribution T(·|·) between vocabulary terms. The form of the translation query generation model is then:

在这里插入图片描述

Braylon1002

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
信息检索导论第十二章笔记(英文)

文章目录Language models for information retrievalAbstractLanguage ModelTypes of language modelsMultinomial distributions over wordsThe query likelihood modelEstimating the query generation probabilityLanguage modeling versus other approachesin information retr
复制链接

扫一扫