上一篇做的工作是:使用Lucene构建歧义实体映射index、歧义实体上下文index
这是还差entities的index,否则怎么查entities呢!
LingPipe是个天然的entities recognise工具,有很多用法,具体参考官网,文末给出了链接。
使用LingPipe构建entities的index不多说了,直接上代码:
//entityDictionaryChunkerFF, LingPipe
//index all entitys
public static void BuildEntityDictionary() throws Exception
{
double CHUNK_SCORE = 1.0;
//String entityPath="E:/LuceneDocument/long_abstracts_preprocessing_entity(file_contents_examples).txt";
String entityPath="E:/LuceneDocument/long_abstracts_preprocessing_entity.txt";
MapDictionary<String> dictionary = new MapDictionary<String>();
FileReader fr=new FileReader(entityPath);
BufferedReader br=new BufferedReader(fr);
String entity="";
int i=0;
while ((entity=br.readLine())!=null)
{
i++;
if(i>500000) //共有463万,这里只取前100万
{
break;
}
System.out.println(i+"=>"+entity);
dictionary.addEntry(new DictionaryEntry<String>(entity,"DBpedia_entity",CHUNK_SCORE));
}
br.close();
fr.close();
entityDictionaryChunkerFF = new ExactDictionaryChunker(dictionary,
IndoEuropeanTokenizerFactory.INSTANCE,
false,false);
//All matches is false, Case sensitive is false
//difference see http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html
//FF can recognize "German Empire", but TF can't
System.out.println("dictionary size:\n" + dictionary.size());
}
有了entities的index,就可以做entities linking了,参考下一篇。
参考文献:
[1] Mendes, Pablo N, Jakob, Max, Garc&#, et al. DBpedia spotlight: Shedding light on the web of documents[C]// Proceedings of the 7th International Conference on Semantic Systems. ACM, 2011:1-8.
[2] Han X, Sun L. A Generative Entity-Mention Model for Linking Entities with Knowledge Base.[J]. Proceeding of Acl, 2011:945-954.
[4] http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html
[5] http://wiki.dbpedia.org/Downloads2014
[6] http://www.oschina.net/p/jieba(结巴分词)