信息检索导论第十章笔记(英文)

XML Retrieval

Abstract

There are fundamental differences between IR and database systems in terms of retrieval model, data structures, and query language.

在这里插入图片描述

In the chapter, we look at how ranked retrieval methods can be adapted to structured documents to address these problems.

  • Problems
  1. Unranked retrieval methods suffer from low recall.
  2. Most users are also notoriously bad at precisely stating strctural constraints.
  3. Users are mostly unfamiliar with structured search and advanced search interfaces or unwilling touse them.

This is why we try to recommend the XML Retrieval.

  • where to use the XML Retrieval
    digital libraries
    patent databases
    blogs
    text in which entities like persons and locations have been tagged

just like the examples in the raw book:
在这里插入图片描述

but the RDB search is not suitable to the above three situation. There are three main problems which is mentioned in the abstract.

In order to solve that, we adapt the ranked retrieval methods.

Basic concepts

This is a XML document:

在这里插入图片描述

An xml document is an ordered, labeled tree. Each node of the tree is an XML element and is written with an opening and closing tag.

And one element can have one or more XML attributes.
like:

在这里插入图片描述

This is a structured tree figure of an XML object:

在这里插入图片描述

  • DOM

the XML DOM reffers to the XML document object model, like the above figure.

  • XPath

XPath is a standard for enumerating paths in an XML documents collection. Only a small subset of XPath is needed for our purposes.

For example,
title#“Macbeth” selects all titles containing
the term Macbeth.

  • SCHEMA

A schema for Shakespeare’s plays may stipulate that scenes can only occur as children of acts and that only acts and scenes have the number attribute.

  • NEXI

Narrowed Extended XPath I.

In particular, //section is embedded under //article.

在这里插入图片描述

The two yr conditions are relational attribute constraints.

We can representb querues as trees in the same way.

In this figure:

在这里插入图片描述

q1 is a search for books whose titles score highly for the keywords Julius Caesar. q2 is a search for books whose author elements score highly for Julius Caesar and whose title elements score highly for Gallic war.

Challenges in XML retrieval

Challenge 1

If we are looking for Macbeth’s castle, the search should return the scene, act or the whole play.

Users hope get a part of the document(the XML element), not the entire document like unstructured retrieval.

  • Structured doocument retrieval principle

Choose the best matched part of the document.
The system should always return the most explict and suitable answers.

According to this principle, it motivateds a retrieval strategy thatreturns the smallest unit that contains the information sought.

Challenge 2

how to define the indexing unit.

In structured retrieval, there are a series of different methods.

  1. group nodes into nonvverlapping pseudodocuments.

在这里插入图片描述

pseudodocuments may not make sence to the users because they are not coherent units

  1. use one of the largest elements as the indexing unit

this two-stage retrieval process fails to return the best subelement for many qyeries because the relevance of a whole book is often not a good predictor of the relevance of small subelements within it

  1. search all leaves, select the most relevant ones, and extend them to larger units in postprocessing(bottom up)

However, it has a similar problem as the last one: the relevance of a leaf element is often not a good predictor of the relevance of elements in which it is contained.

  1. index all elements

Indexing all elements means that search results will be highly redundant.

Because of the redundancy caused by nested elements, it is common to restrict the set of elements that are eligible to be returned. Restriction strategies
include:

在这里插入图片描述

Challenge 3

a challenge in XML retrieval related to nesting is that we may need to distinguish different contexts of a term when we compute term statistics for ranking.

In order to solve that, we compute a corresponding idf for each XML context respectively.

A vector space model for XML retrieval

For each dimension of vector space, the word and its position in the XML tree are considered simultaneously

How to do that: Mapping XML documents to lexicalized subtrees.

在这里插入图片描述

Therefore, queries and documents can be represented as vectors in the lexicalized readme space, and the similarity can be calculated according to the previous vector similarity formula

  • the difference between structured and unstructured method

The main difference between the two is that in the former, each dimension of the vector corresponds to a term, while in the latter, it corresponds to a lexicalized subtree.

There is a tradeoff between the dimension of vector space and the precision of query,
Index all paths that end up with a single term, in other words, all XML contexts.

A simple measure of the similarity of a path cq in a query and a path cd in a document is the following
context context resemblance function Cr:

在这里插入图片描述

For example,

在这里插入图片描述

The final score for a document is computed as a variant of the cosine measure:

在这里插入图片描述

Note:
SimNoMerge is not a real cosine function, because it may exceed 1.0.

  • the algorithm

在这里插入图片描述

Evaluation of XML retrieval

  • INEX

The first evaluation platform in XML Retrieval Research.

INEX 2002 collections statistics

在这里插入图片描述

INEX 2002 defined component coverage and topical relevance as orthogonal dimensions of relevance.

component coverage

The component coverage dimension evaluates whether the element retrieved is “structurally” correct, that is, neither too low nor too high in the tree.

Which has four cases:
Exact coverage(E)
Too small(S)
Too large(L)
No coverage(N)

Topic relevance

Topic relevance dimension also has four levels:

highly relevant(3)
fairly relevant(2)
marginally relevant(1)
nonrelevant(0)


the relevance-coverage combinations are quantized as folls:

在这里插入图片描述


The number of relevant components in a retrieved set A of components can then be computed as:

在这里插入图片描述

Note:
As an approximation, the standard definitions of precision, recall and F from Chapter 8 can be applied to this modified definition of relevant items retrieved, with some subtleties because we sum graded as opposed to binary relevance assessments.

  • Disadvantages

The overlap is not accounted for.
This problem is worse in XML retrieval because of the problem of multiple nested elements occurring in a search result.

Text-centric versus data-centric XML retrieval

Text-centric

we give higher priority to text. We do this by adapting unstructured retrieval methods to handling additional structural constraints. The premise of our approach is that XML document retrieval is characterized by

(i) long text fields (e.g., sections of a document) (ii) inexact matching
(iii) relevanceranked results. Relational databases do not deal well with this use case.

Data-centric

Data-centric in contrast, data-centric XML mainly encodes numerical and nontext XML attribute-value data. When querying data-centric XML, we want to impose
exact match conditions in most cases.
This puts the emphasis on the structural aspects of XML documents and queries.

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值