Thesaurus及term自动关联
- 上文讲到的wordnet和hownet都是利用了专家知识和大量的人工整理出来的,那么,可不可以自动生成相似的词语呢?
Definition: Two words are similar if they co-occur with similar word(类似word2vec的思想) - 聚类(在数据挖掘中讲过,在这不多涉及)
-
Partitional clustering
最典型的是k-means:
这是典型的E-M算法思想。E阶段固定参数 θ \theta θ(类中心点位置),M阶段最大化隐型变量的概率(每个点属于哪类)K-means的停止条件:
- A fixed number of iterations.
- Term partition unchanged.
- Centroid positions don’t change.
K-means的优点和缺点:
-
Hierarchical clustering
- Bottom-Up (agglomerative):Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together.
- Top-Down (divisive):Starting with all the data in a single cluster, consider every possible way to divide the cluster into two. Choose the best division and recursively operate on both side.
在hierarchical clustering中衡量类间相似度的方法: - Single linkage (nearest neighbor):the distance between two clusters is determined by the distance of the two closest objects in the different clusters.
- Complete linkage (furthest neighbor):the distances between clusters are determined by the greatest distance between any two objects in the different clusters (i.e., by the “furthest neighbors”).
- Group average linkage:the distance between two clusters is calculated as the average distance between all pairs of objects in the two different clusters.
hierarchical clustering 的优缺点:- No need to specify the number of clusters in advance.
- Hierarchal nature maps nicely onto human intuition for some domains
- They do not scale well: time complexity of at least O(sqr(n)), where n is the number of total objects.
- Like any heuristic search algorithms, local optima are a problem.
- Interpretation of results is (very) subjective
词语的聚类就是按下面的词-词相似度来聚类,以达到高的类内相似度和低的类间相似度
Construction of term phrases
"重症流感药物"在这里看成了一个短语。
用统计方法确定短语:
M
I
k
h
=
P
a
i
r
k
h
T
F
k
T
F
h
MI_{kh} = \frac{Pair_{kh}}{TF_kTF_h}
MIkh=TFkTFhPairkh
其中k、h是词语,Pair_kh表示kh相邻的次数,TF代表总词频。
总结
Automatic indexing process capable of producing high-performance retrieval results:
- (1) Terms in the medium-frequency ranges with positive discrimination values are used as index terms directly without further transformation.
- (2) The broad high-frequency terms with negative discrimination values are either discarded or incorporated into phrases with low-frequency terms.
- (3) The narrow low-frequency terms with discrimination values close to zero are broadened by inclusion into thesaurus categories.
Expand query:
Term ambiguity may introduce irrelevant statistically correlated terms.
e.g. “Apple computer” ->“Apple red fruit computer”
So, only expand query with terms that are similar to all terms in the query.
e.g. “fruit” not added to “Apple computer” since it is far from “computer.” “fruit” added to “apple pie” since “fruit” close to both “apple” and “pie.