信息检索导论第十二章笔记(英文)

Language models for information retrieval

Abstract

Firstly, this chapter intorduce the concept of language models and then describe the basic and most commonly used language modeling approach to IR, the query likelihood model. And compare the language modeling approach and other approaches to IR in this chaper. Finally, it briefly describes various extensions to the language modeling approach.

Language Model

The simplest language model is equivalent to a probabilistic finite automaton.

在这里插入图片描述

A simple finite automaton and some of the strings in the language it generates. shows the start state of the automaton and a double circle indicates a (possible) finishing state.

在这里插入图片描述

A one-state finite automaton that acts as a unigram language model. We show a partial specification of the state emission probabilities.

Types of language models

  • unigram language model

在这里插入图片描述

for example:

在这里插入图片描述

  • bigram language models

在这里插入图片描述

which condition on the previous term

在这里插入图片描述

Such models are vital for tasks like speech recognition, spelling correction, and machine translation, where you need the probability of a term conditioned on surrounding context. However, most language-modeling work in IR has used unigram language models.

Multinomial distributions over words

the equations presented above do not present the multinomial probability of a bag of words, because they do not sum over all possible orderings of those words, as is done by the multinomial coefficient (the first term on the righthand side) in the standard presentation of a multinomial model:

在这里插入图片描述

The fundamental problem in designing language models is that we do notknow what exactly we should use as the model M d M_d Md

We pretend that the documentdis only a representative sample of text drawnfrom a model distribution, treating it like a fine-grained topic. We then esti-mate a language model from this sample, and use that model to calculate the probability of observing any word sequence, and, finally, we rank documentsaccording to their probability of generating the query

The query likelihood model

We construct the corresponding language model M d M_d Md for each document D in the document set. Our goal is to sort the documents according to their likelihood P(d|q) associated with the query.

在这里插入图片描述

The most common way to do this is to use the multinomial unigram language model, which is equivalent to a multinomial naive Bayes model,

so we have:

在这里插入图片描述

K q K_q Kq is the multinomial coefficient for the query q.

  • For retrieval based on a language model (henceforth LM), we treat the generation of queries as a random process. The approach is to:
    1. infer a LM for each document
    1. Estimate P(q|Mdi), the probability of generating the query according to each of these document models
    1. rank the documents according to these probabolities

Estimating the query generation probability

How to estimate P(q|Md).

The probability of producing the query given the LM Md of document d using maximum likelihood estimation (MLE) and the unigram assumption is:

在这里插入图片描述

  • approaches to smoothing probability distributions
  1. if t f ( t , d ) tf_(t,d) tf(t,d) = 0, then

在这里插入图片描述

where cftis the raw count of the term in the collection, andTis the raw size(number of tokens) of the entire collection.

  1. a simple idea that works well inpractice is to use a mixture between a document-specific multinomial distri-bution and a multinomial distribution estimated from the entire collection:

在这里插入图片描述

Mc is a language model built from the entire document collection

  1. use an LM built from the whole collection as a priordistribution in aBayesian updating process

在这里插入图片描述

Language modeling versus other approachesin information retrieval

The LM approach provides a novel way of looking at the problem of textretrieval, which links it with a lot of recent work in speech and language processing.

The LM approach assumes that documents and experssions of information needs are objects of the same type, and assesses their match by importing the toolsand methods of language modeling from speech and natural language pro-cessing.

The resulting model is mathematically precise, conceptually simple, computationally tractable, and intuitively appealing.

Extended language modeling approaches

  • You can look at the probability of a query language model M q M_q Mq generating the document

This way is less appealing is that there is much less text available to estimate a LM based on the query text and is easy to see how to incorporate relevance feedback into such a model.

  • Rather than directly generating in either direction, we can make an LMfrom both the document and query, and then ask how different these twolanguage models are from each other.

在这里插入图片描述

Three ways of developing the language modeling approach. (a) Query likelihood.
(b) Document likelihood.
© Model comparison.

For instance, one way to model the risk of returning a documentdas relevant to a query q is to use the (KL) divergence between their respective language models:

在这里插入图片描述

  • intorduce translation models to bridge query-document gap ro address issues of alternate expression.

assume that the translation model can be represented by a conditional probability distribution T(·|·) between vocabulary terms. The form of the translation query generation model is then:

在这里插入图片描述

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值