信息检索（六）-- 文本分析及自动标引(Part 3)

最新推荐文章于 2021-06-27 12:51:05 发布

愉贵妃珂里叶特氏海兰

最新推荐文章于 2021-06-27 12:51:05 发布

阅读量360

点赞数

分类专栏：信息检索-THU-2020春文章标签：信息检索

本文链接：https://blog.csdn.net/weixin_41332009/article/details/111469804

版权

信息检索-THU-2020春专栏收录该内容

13 篇文章 6 订阅

订阅专栏

Thesaurus及term自动关联

上文讲到的wordnet和hownet都是利用了专家知识和大量的人工整理出来的，那么，可不可以自动生成相似的词语呢？
Definition: Two words are similar if they co-occur with similar word（类似word2vec的思想）
聚类(在数据挖掘中讲过，在这不多涉及)

Partitional clustering
最典型的是k-means：

这是典型的E-M算法思想。E阶段固定参数 $\theta$ (类中心点位置)，M阶段最大化隐型变量的概率（每个点属于哪类）

K-means的停止条件：
- A fixed number of iterations.
- Term partition unchanged.
- Centroid positions don’t change.
K-means的优点和缺点：
Hierarchical clustering
- Bottom-Up (agglomerative):Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together.
- Top-Down (divisive):Starting with all the data in a single cluster, consider every possible way to divide the cluster into two. Choose the best division and recursively operate on both side.
  
  在hierarchical clustering中衡量类间相似度的方法：
- Single linkage (nearest neighbor):the distance between two clusters is determined by the distance of the two closest objects in the different clusters.
- Complete linkage (furthest neighbor):the distances between clusters are determined by the greatest distance between any two objects in the different clusters (i.e., by the “furthest neighbors”).
- Group average linkage:the distance between two clusters is calculated as the average distance between all pairs of objects in the two different clusters.
  
  hierarchical clustering 的优缺点：
  - No need to specify the number of clusters in advance.
  - Hierarchal nature maps nicely onto human intuition for some domains
  - They do not scale well: time complexity of at least O(sqr(n)), where n is the number of total objects.
  - Like any heuristic search algorithms, local optima are a problem.
  - Interpretation of results is (very) subjective

词语的聚类就是按下面的词-词相似度来聚类，以达到高的类内相似度和低的类间相似度
在这里插入图片描述

Construction of term phrases

在这里插入图片描述
"重症流感药物"在这里看成了一个短语。
用统计方法确定短语：
$MI_{kh} = \frac{Pair_{kh}}{TF_kTF_h}$
其中k、h是词语，Pair_kh表示kh相邻的次数，TF代表总词频。

总结

Automatic indexing process capable of producing high-performance retrieval results:

(1) Terms in the medium-frequency ranges with positive discrimination values are used as index terms directly without further transformation.
(2) The broad high-frequency terms with negative discrimination values are either discarded or incorporated into phrases with low-frequency terms.
(3) The narrow low-frequency terms with discrimination values close to zero are broadened by inclusion into thesaurus categories.

Expand query：
在这里插入图片描述
Term ambiguity may introduce irrelevant statistically correlated terms.
e.g. “Apple computer” ->“Apple red fruit computer”
So, only expand query with terms that are similar to all terms in the query.
e.g. “fruit” not added to “Apple computer” since it is far from “computer.” “fruit” added to “apple pie” since “fruit” close to both “apple” and “pie.

愉贵妃珂里叶特氏海兰

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
信息检索（六）-- 文本分析及自动标引(Part 3)

Thesaurus及term自动关联上文讲到的wordnet和hownet都是利用了专家知识和大量的人工整理出来的，那么，可不可以自动生成相似的词语呢？Definition: Two words are similar if they co-occur with similar word（类似word2vec的思想）聚类(在数据挖掘中讲过，在这不多涉及)Partitional clustering最典型的是k-means：这是典型的E-M算法思想。E阶段固定参数θ\thetaθ(类中心
复制链接

扫一扫

专栏目录