信息检索导论第十章笔记(英文)

最新推荐文章于 2024-01-05 08:45:00 发布

Braylon1002

最新推荐文章于 2024-01-05 08:45:00 发布

阅读量1.1k

点赞数 1

分类专栏：数据挖掘文章标签：信息检索导论

本文链接：https://blog.csdn.net/qq_40742298/article/details/108319311

版权

数据挖掘专栏收录该内容

53 篇文章 11 订阅

订阅专栏

XML Retrieval

Abstract

There are fundamental differences between IR and database systems in terms of retrieval model, data structures, and query language.

在这里插入图片描述

In the chapter, we look at how ranked retrieval methods can be adapted to structured documents to address these problems.

Problems

Unranked retrieval methods suffer from low recall.
Most users are also notoriously bad at precisely stating strctural constraints.
Users are mostly unfamiliar with structured search and advanced search interfaces or unwilling touse them.

This is why we try to recommend the XML Retrieval.

where to use the XML Retrieval
digital libraries
patent databases
blogs
text in which entities like persons and locations have been tagged

just like the examples in the raw book:

but the RDB search is not suitable to the above three situation. There are three main problems which is mentioned in the abstract.

In order to solve that, we adapt the ranked retrieval methods.

Basic concepts

This is a XML document:

在这里插入图片描述

An xml document is an ordered, labeled tree. Each node of the tree is an XML element and is written with an opening and closing tag.

And one element can have one or more XML attributes.
like:

在这里插入图片描述

This is a structured tree figure of an XML object:

在这里插入图片描述

the XML DOM reffers to the XML document object model, like the above figure.

XPath

XPath is a standard for enumerating paths in an XML documents collection. Only a small subset of XPath is needed for our purposes.

For example,
title#“Macbeth” selects all titles containing
the term Macbeth.

SCHEMA

A schema for Shakespeare’s plays may stipulate that scenes can only occur as children of acts and that only acts and scenes have the number attribute.

NEXI

Narrowed Extended XPath I.

In particular, //section is embedded under //article.

在这里插入图片描述

The two yr conditions are relational attribute constraints.

We can representb querues as trees in the same way.

In this figure:

在这里插入图片描述

q1 is a search for books whose titles score highly for the keywords Julius Caesar. q2 is a search for books whose author elements score highly for Julius Caesar and whose title elements score highly for Gallic war.

Challenges in XML retrieval

Challenge 1

If we are looking for Macbeth’s castle, the search should return the scene, act or the whole play.

Users hope get a part of the document(the XML element), not the entire document like unstructured retrieval.

Structured doocument retrieval principle

Choose the best matched part of the document.
The system should always return the most explict and suitable answers.

According to this principle, it motivateds a retrieval strategy thatreturns the smallest unit that contains the information sought.

Challenge 2

how to define the indexing unit.

In structured retrieval, there are a series of different methods.

group nodes into nonvverlapping pseudodocuments.

在这里插入图片描述

pseudodocuments may not make sence to the users because they are not coherent units

use one of the largest elements as the indexing unit

this two-stage retrieval process fails to return the best subelement for many qyeries because the relevance of a whole book is often not a good predictor of the relevance of small subelements within it

search all leaves, select the most relevant ones, and extend them to larger units in postprocessing(bottom up)

However, it has a similar problem as the last one: the relevance of a leaf element is often not a good predictor of the relevance of elements in which it is contained.

index all elements

Indexing all elements means that search results will be highly redundant.

Because of the redundancy caused by nested elements, it is common to restrict the set of elements that are eligible to be returned. Restriction strategies
include:

在这里插入图片描述

Challenge 3

a challenge in XML retrieval related to nesting is that we may need to distinguish different contexts of a term when we compute term statistics for ranking.

In order to solve that, we compute a corresponding idf for each XML context respectively.

A vector space model for XML retrieval

For each dimension of vector space, the word and its position in the XML tree are considered simultaneously

How to do that: Mapping XML documents to lexicalized subtrees.

在这里插入图片描述

Therefore, queries and documents can be represented as vectors in the lexicalized readme space, and the similarity can be calculated according to the previous vector similarity formula

the difference between structured and unstructured method

The main difference between the two is that in the former, each dimension of the vector corresponds to a term, while in the latter, it corresponds to a lexicalized subtree.

There is a tradeoff between the dimension of vector space and the precision of query,
Index all paths that end up with a single term, in other words, all XML contexts.

A simple measure of the similarity of a path cq in a query and a path cd in a document is the following
context context resemblance function Cr:

在这里插入图片描述

For example,

在这里插入图片描述

The final score for a document is computed as a variant of the cosine measure:

在这里插入图片描述

Note:
SimNoMerge is not a real cosine function, because it may exceed 1.0.

the algorithm

在这里插入图片描述

Evaluation of XML retrieval

INEX

The first evaluation platform in XML Retrieval Research.

INEX 2002 collections statistics

在这里插入图片描述

INEX 2002 defined component coverage and topical relevance as orthogonal dimensions of relevance.

component coverage

The component coverage dimension evaluates whether the element retrieved is “structurally” correct, that is, neither too low nor too high in the tree.

Which has four cases:
Exact coverage(E)
Too small(S)
Too large(L)
No coverage(N)

Topic relevance

Topic relevance dimension also has four levels:

highly relevant(3)
fairly relevant(2)
marginally relevant(1)
nonrelevant(0)

the relevance-coverage combinations are quantized as folls:

在这里插入图片描述

The number of relevant components in a retrieved set A of components can then be computed as:

在这里插入图片描述

Note:
As an approximation, the standard definitions of precision, recall and F from Chapter 8 can be applied to this modified definition of relevant items retrieved, with some subtleties because we sum graded as opposed to binary relevance assessments.

Disadvantages

The overlap is not accounted for.
This problem is worse in XML retrieval because of the problem of multiple nested elements occurring in a search result.

Text-centric versus data-centric XML retrieval

Text-centric

we give higher priority to text. We do this by adapting unstructured retrieval methods to handling additional structural constraints. The premise of our approach is that XML document retrieval is characterized by

(i) long text fields (e.g., sections of a document) (ii) inexact matching
(iii) relevanceranked results. Relational databases do not deal well with this use case.

Data-centric

Data-centric in contrast, data-centric XML mainly encodes numerical and nontext XML attribute-value data. When querying data-centric XML, we want to impose
exact match conditions in most cases.
This puts the emphasis on the structural aspects of XML documents and queries.

Braylon1002

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
信息检索导论第十章笔记(英文)

XML RetrievalAbstractThere are fundamental differences between IR and database systems in terms of retrieval model, data structures, and query language.In the chapter, we look at how ranked retrieval methods can be adapted to structured documents to ad
复制链接

扫一扫

专栏目录