信息检索(六)-- 文本分析及自动标引(Part 3)

Thesaurus及term自动关联
  1. 上文讲到的wordnet和hownet都是利用了专家知识和大量的人工整理出来的,那么,可不可以自动生成相似的词语呢?
    Definition: Two words are similar if they co-occur with similar word(类似word2vec的思想)
  2. 聚类(在数据挖掘中讲过,在这不多涉及)
  • Partitional clustering
    最典型的是k-means:
    在这里插入图片描述
    这是典型的E-M算法思想。E阶段固定参数 θ \theta θ(类中心点位置),M阶段最大化隐型变量的概率(每个点属于哪类)

    K-means的停止条件:

    • A fixed number of iterations.
    • Term partition unchanged.
    • Centroid positions don’t change.

    K-means的优点和缺点:
    在这里插入图片描述

  • Hierarchical clustering

    • Bottom-Up (agglomerative):Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together.
    • Top-Down (divisive):Starting with all the data in a single cluster, consider every possible way to divide the cluster into two. Choose the best division and recursively operate on both side.

      在hierarchical clustering中衡量类间相似度的方法:
    • Single linkage (nearest neighbor):the distance between two clusters is determined by the distance of the two closest objects in the different clusters.
    • Complete linkage (furthest neighbor):the distances between clusters are determined by the greatest distance between any two objects in the different clusters (i.e., by the “furthest neighbors”).
    • Group average linkage:the distance between two clusters is calculated as the average distance between all pairs of objects in the two different clusters.

      hierarchical clustering 的优缺点:
      • No need to specify the number of clusters in advance.
      • Hierarchal nature maps nicely onto human intuition for some domains
      • They do not scale well: time complexity of at least O(sqr(n)), where n is the number of total objects.
      • Like any heuristic search algorithms, local optima are a problem.
      • Interpretation of results is (very) subjective

词语的聚类就是按下面的词-词相似度来聚类,以达到高的类内相似度和低的类间相似度
在这里插入图片描述

Construction of term phrases

在这里插入图片描述
"重症流感药物"在这里看成了一个短语。
用统计方法确定短语:
M I k h = P a i r k h T F k T F h MI_{kh} = \frac{Pair_{kh}}{TF_kTF_h} MIkh=TFkTFhPairkh
其中k、h是词语,Pair_kh表示kh相邻的次数,TF代表总词频。

总结

Automatic indexing process capable of producing high-performance retrieval results:

  • (1) Terms in the medium-frequency ranges with positive discrimination values are used as index terms directly without further transformation.
  • (2) The broad high-frequency terms with negative discrimination values are either discarded or incorporated into phrases with low-frequency terms.
  • (3) The narrow low-frequency terms with discrimination values close to zero are broadened by inclusion into thesaurus categories.

Expand query:
在这里插入图片描述
Term ambiguity may introduce irrelevant statistically correlated terms.
e.g. “Apple computer” ->“Apple red fruit computer”
So, only expand query with terms that are similar to all terms in the query.
e.g. “fruit” not added to “Apple computer” since it is far from “computer.” “fruit” added to “apple pie” since “fruit” close to both “apple” and “pie.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值