本文链接：https://blog.csdn.net/zj71hmvx/article/details/122613549

Pedersen (2010): Information Content Measures of Semantic Similarity Perform Better without Sense Tagged Text
语义相似性的信息内容度量在没有 Sense-Tagged Text 情况下表现更好
anle一周阅读材料，不算很难读

这个论文提出了对相似性度量对于基于信息内容的概念对pairs of concepts 的经验性比较。
该文章表明，与最大的可用人工注释语义标记文本语料库相比，使用适量的未标记文本来获取信息内容与人相似性判断的相关性更高。

介绍

最简单的测试相似度的方式是用基于word net的物理路径长度去衡量，但是这个方法有限制
与非常一般的概念之间的可比较路径长度相比，非常具体的概念之间的路径长度的语义相似性差异要小得多
草这句话绕了我半天（。

Most significant is that path lengths between very specific concepts imply much smaller distinctions in semantic similarity than do comparable path lengths between very general concepts.

可以用原始的未注释语料库中派生的信息来扩充概念，这个解决方法的关键不在于使用的特定类型的语料库，而在于增加 WordNet 中与它们相关的概念的数量。
语义相似性的信息内容度量可以由此显着提高，而不需要创建带有语义标签的语料库（太贵了建不起

Information Content (IC) 是衡量概念特异性的指标，值越低这个词汇用的频率就更高
对于 WordNet 中的每个概念c，IC 定义为该概念c 概率的负对数。
且IC 只能用在计算名词和动词，因为word net是层次结构，每层是独立的。
翻译出来是More general concepts have lower information content than more specific concepts.

语义相似性度量

WordNet::Similarity里有三种语义相似性度量方法： (res) (Resnik,1995), (jcn) (Jiang and Conrath, 1997), and (lin)(Lin, 1998).
这三种方法都是有依赖 least common subsumer (LCS)的概念；例如，汽车和踏板车的 LCS 是车辆.

Resnik (res) 度量只是使用 LCS 的信息内容作为相似度值，lin和jcs都在res基础上有改进。

实验数据

This paper compares the ranking of pairs of concepts according to Information Content measures in WordNet::Similarity with a number of manually created gold standards.
These include the (RG) (Rubenstein and Goodenough, 1965) collection of 65 noun pairs, the (MC) (Miller and Charles, 1991) collection of 30 noun pairs (a subset of RG), and the (WS) WordSimilarity-353 collection of 353 pairs (Finkelstein et al., 2002).
RG and MC have been scored for similarity, while WS is scored for relatedness, which is a more general and less well–defined notion than similarity. For example aspirin and headache are clearly related, but they aren’t really similar.

结果

本文表明，通过增加用于派生信息内容的频率计数的覆盖范围，可以显着改进基于信息内容的语义相似性测量。增加的覆盖率可以来自未注释的文本或为 WordNet 中的每个概念简单计数，并且不需要带有语义标签的文本。
This paper shows that semantic similarity measures based on Information Content can be significantly improved by increasing the coverage of the frequency counts used to derive Information Content. Increased coverage can come from unannotated text or simply assigning counts to every concept in WordNet and does not require sense–tagged text.