Candidate Entity Ranking
两种排序方式:
- Supervised ranking methods
- unsupervised ranking methods
Features
两种feature
- context-independent features
Simply check weather the mention and the entity label in the KB match:
- exact matching
- dice coefficient(系数/协同) score
- hemming distance
- context-dependent features 需要读取实体上下文
entity popularity: 选一个最常见的释义
entity type: NER可以返回给定词的broad type (person, organisation, location…). 通过判断type的配对来确定含义。
bag of words (BOW)
all words in the doc that contains the entity mention and match with the words associated to the entity
concept vectors
从给定文章中可以提取出key-phrases, anchor text, named entities. 用这些features来创建vector来代表实体和释义。它们之间的相似度可以用cosine similarity和jaccard similarity来进行计算。
coherence between mapping
在一篇文章里,实体和一到两个主题是一致的
可以通过计算两个实体和两个释义的相关度来进行计算。在Wikipedia中我们可以通过计算有多少篇文章关联向同一对实体。
Supervised ranking methods
Binary classification methods
输入<mention, entity>, 我们可以训练一个classifier返回1或0来判别mapping是否准确。
eg. SVM, Naive Bayes Classifiers
probabilistic methods
除了用classifier,我们还可以用概率模型来表示准确度
Unsupervised ranking methods
Graph based approaches
AIDA system: entity-mention and entity-entity relations as a graph. 每一条边都代表实体释义可能性的权重
find a subgraph where only one entity-mention edges with max weight. NP-hard (greedy algorithm)
VSM based models (vector space model)
获取好的训练数据困难又很贵
只计算释义和备选实体间的相似度
Unlinkable mention prediction
ignore the problem.
如果备选项为零,假定实体不可连接
use a threshold value on the ranking score
train a binary classifier
add NIL as special entity. 如果NIL得分最高,则认为实体不可连接