注意,先提取有歧义的entity,再提取该entity对应的上下文(无歧义的entity没必要提取上下文)。
disambiguations_preprocessing.py
for line in f:
temp=line.split("/resource/")
if len(temp)!=3: #some data has problem
continue
first=temp[1]
endIndex=first.find(">")
first=first[:endIndex]
first=first.split("_")[0]
second=temp[2]
endIndex=second.rfind(">")
second=second[:endIndex]
print first,"=>",second
if tempFirst!=first: #a new entity
tempFirst=first
ff.write("\n"+first+"=>"+second.replace("_"," "))
else: #an old entity
ff.write("<="+second.replace("_"," "))
long_abstracts_preprocessing.py:
提取有歧义的entity的上下文,同时提取所有的entity。
for line in f:
temp=line.split("/resource/")
if len(temp)==1: #有些数据有问题
continue
line=temp[1]
#print line
endIndex=line.find(">")
entity=line[:endIndex]
entity=entity.replace("_"," ")
#print i,"=>",entity
fff.write(entity+"\n")
if ambiguationWordDict.has_key(entity): #有歧义的entity才需要存储上下文
stratIndex=line.find("\"")
endIndex=line.rfind("\"")
rawContext=line[stratIndex+1:endIndex]
#print rawContext
contextWordDict={}
rawWords = jieba.cut(rawContext.lower()) #默认是精确模式
for word in rawWords:
if len(word)<4:
continue
if swDict.has_key(word):
continue
if contextWordDict.has_key(word):
contextWordDict[word]+=1
else:
contextWordDict[word]=1
ff.write(entity+"=>")
for key in contextWordDict.keys():
ff.write(key+":"+str(contextWordDict[key])+" ")
ff.write("\n")
使用Lucene构建歧义映射index 和 使用Lucene构建歧义实体上下文index 参考下一篇。
参考文献:
[1] Mendes, Pablo N, Jakob, Max, Garc&#, et al. DBpedia spotlight: Shedding light on the web of documents[C]// Proceedings of the 7th International Conference on Semantic Systems. ACM, 2011:1-8.
[2] Han X, Sun L. A Generative Entity-Mention Model for Linking Entities with Knowledge Base.[J]. Proceeding of Acl, 2011:945-954.
[4] http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html
[5] http://wiki.dbpedia.org/Downloads2014
[6] http://www.oschina.net/p/jieba(结巴分词)