【论文阅读 CIKM2011】Finding Dimensions for Queries

Foreword

Abs

We address the problem of finding multiple groups of words or phrases that explain the underlying query facets, which we refer to as query dimensions. We assume that the important aspects of a query are usually presented and repeated in the query’s top retrieved documents in the style of lists, and query dimensions can be mined out by aggregating these significant lists.

we propose aggregating frequent lists within the top search results to mine query dimensions and implement a system called QDMiner.

Method

QDMiner discovers query dimensions by aggregating frequent lists within the top results.

  • Important information is usually organized in list formats by websites
    • Listing is a graceful way to show parallel knowledge or items
  • Important lists are commonly supported by relevant websites and hence repeat in the top search results, whereas unimportant lists just infrequently appear in results.

Query dimensions are mined by the following four steps:

  • List Extraction: Several types of lists are extracted from each document
  • List Weighting: All extracted lists are weighted, and thus some unimportant or noisy lists can be assigned by low weights
  • List Clustering: Similar lists are grouped together to compose a dimension.
  • Dimension and Item Ranking: Dimensions(between dimensions) and their items(with a dimension) are evaluated and ranked based on their importance.

List Extraction

For each document, we extract a set of lists from the HTML content of d d d based on three different types of patterns

  • Free text patterns:

    • pattern: item{, item}*(and|or) {other} item

      Example 1 We shop for gorgeous watches from Seiko, Bulova, Lucien Piccard, Citizen, Cartier or Invicta

    • further use pattern: {ˆitem (: |-) .+$}+ to extract lists from some semi-structured paragraphs

      Example 2 … are highly important for following reasons: Consistency - every fact table is filtered consistently res… Integration - queries are able to drill different processes … Reduced development time to market - the common dimensions are available without recreating the wheel over again.

  • HTML tag patterns:

    • style of HTML tags
      • SELECT: extract all text from their child tags(OPTION) to create a list
      • UL / OL: extract text within their child tags(LI)
      • TABLE: extract one list from each column or each row

在这里插入图片描述

  • Repeat region patterns:

    在这里插入图片描述

    • First detect repeat regions in webpages based on vision-based DOM trees

    • Then extract all leaf HTML nodes within each block, and group them by their tag names(name, rating, etc) and display styles.

    • Last, for each group, extract all text from its nodes as a list

    Note: we do post-processing for each extracted list

List Weighting

在这里插入图片描述

This type of lists are useless for finding dimensions and we should punish them.

we propose to aggregate all lists of a query, and evaluate the importance of each unique list l by the following components:

  • document matching weight: S D O C = ∑ d ∈ R ( s d m ∗ s d r ) S_{\mathrm{DOC}}=\sum_{d \in R}\left(s_d^m * s_d^r\right) SDOC=dR(sdmsdr)

    • d d m d_d^m ddm is the percentage of items contained in d d d
      • s d m = ∣ d ∩ l ∣ ∣ l ∣ s_d^m=\frac{|d \cap l|}{|l|} sdm=ldl
    • s d r s_d^r sdr measures the importance of document d d d
      • s d r = 1 / ran ⁡ k d s_d^r=1 / \sqrt{\operatorname{ran} k_d} sdr=1/rankd
      • The higher d d d​ is ranked, the larger its score s d r s_d^r sdr is.( d d d is more relevant to the query)
  • average invert document frequency(IDF) of items:

    • A list comprised of common items in a corpus(we use ClueWeb09) is not informative to the query.

The importance of a list l l l: S l = S D O C ∗ S I D F S_l = S_{DOC} * S_{IDF} Sl=SDOCSIDF

List Clustering

Two lists can be grouped together if they share enough items

  • d c ( c 1 , c 2 ) = max ⁡ l 1 ∈ c 1 , l 2 ∈ c 2 d l ( l 1 , l 2 ) = 1 − ∣ l 1 ∩ l 2 ∣ min ⁡ { ∣ l 1 ∣ , ∣ l 2 ∣ } d_c\left(c_1, c_2\right)=\max _{l_1 \in c_1, l_2 \in c_2} d_l\left(l_1, l_2\right) = 1-\frac{\left|l_1 \cap l_2\right|}{\min \left\{\left|l_1\right|,\left|l_2\right|\right\}} dc(c1,c2)=maxl1c1,l2c2dl(l1,l2)=1min{l1,l2}l1l2

Use a modified QT (assume that all data is equally important)clustering algorithm to group similar lists

We modify the original QT algorithm to first group highly weighted lists. The algorithm, which we refer to as WQT (Quality Threshold with Weighted data points)

Don’t use individual weighted lists as query dimensions

Dimension and Item Ranking

A good dimension should frequently appear in the top results, a dimension c c c is more important if:

  • (1) The lists in c c c are extracted from more unique websites
  • (2) the lists in c c c are more important, i.e., they have higher weights.

在这里插入图片描述

  • S l S_l Sl is the weight of a list l l l

In a dimension, the importance of an item depends on how many lists contain the item and its ranks in the lists.

在这里插入图片描述

  • e e e is a item
  • w ( c , e , s ) w(c,e,s) w(c,e,s) is the weight contributed by a website s s s
  • A v g R a n k c , e , s AvgRank_{c,e,s} AvgRankc,e,s is the average rank of e within all lists extracted from website s s s.

We only output qualified items by default in QDMiner.

  • qualified items: S e ∣ c > 1 S_{e|c} > 1 Sec>1 and S e ∣ c > ∣ S i t e s ( c ) ∣ 10 S_{e|c} > \frac{|Sites(c)|}{10} Sec>10Sites(c)
  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 3
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

长命百岁️

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值