（4）使用Lucene、LingPipe做实体链接（Entity Linking）——数据预处理

最新推荐文章于 2024-08-06 04:35:03 发布

mmc2015

最新推荐文章于 2024-08-06 04:35:03 发布

阅读量1.4k

点赞数 1

分类专栏：实体链接（entity linking）文章标签： Lucene LingPipe 实体链接 entity linking

本文链接：https://blog.csdn.net/mmc2015/article/details/50382600

版权

实体链接（entity linking）专栏收录该内容

8 篇文章 4 订阅

订阅专栏

本文介绍了一种从文本中提取有歧义的实体及其上下文的方法，并通过两个Python脚本示例展示了如何实现这一过程。首先，通过`disambiguations_preprocessing.py`脚本提取有歧义的实体；然后，利用`long_abstracts_preprocessing.py`脚本进一步提取这些实体的上下文。此方法有助于改善实体链接任务中的消歧效果。

摘要由CSDN通过智能技术生成

注意，先提取有歧义的entity，再提取该entity对应的上下文（无歧义的entity没必要提取上下文）。

disambiguations_preprocessing.py

for line in f:
    
    temp=line.split("/resource/")
    if len(temp)!=3: #some data has problem
        continue
    
    first=temp[1]
    endIndex=first.find(">")
    first=first[:endIndex]
    first=first.split("_")[0]
    
    second=temp[2]
    endIndex=second.rfind(">")
    second=second[:endIndex]
    
    
    print first,"=>",second
    if tempFirst!=first: #a new entity
        tempFirst=first
        ff.write("\n"+first+"=>"+second.replace("_"," "))
    else: #an old entity
        ff.write("<="+second.replace("_"," "))

long_abstracts_preprocessing.py：

提取有歧义的entity的上下文，同时提取所有的entity。

for line in f:
    
    temp=line.split("/resource/")
    if len(temp)==1: #有些数据有问题
        continue
    line=temp[1]
    #print line
    endIndex=line.find(">")
    entity=line[:endIndex]
    entity=entity.replace("_"," ")
    #print i,"=>",entity
    fff.write(entity+"\n")
    
    if ambiguationWordDict.has_key(entity): #有歧义的entity才需要存储上下文
        stratIndex=line.find("\"")
        endIndex=line.rfind("\"")
        rawContext=line[stratIndex+1:endIndex]
        #print rawContext
        contextWordDict={}
        rawWords = jieba.cut(rawContext.lower()) #默认是精确模式
        for word in rawWords:
            if len(word)<4:
                continue
            if swDict.has_key(word):
                continue
            
            if contextWordDict.has_key(word):
                contextWordDict[word]+=1
            else:
                contextWordDict[word]=1
                
        ff.write(entity+"=>")
        for key in contextWordDict.keys():
            ff.write(key+":"+str(contextWordDict[key])+" ")
        ff.write("\n")

使用Lucene构建歧义映射index 和使用Lucene构建歧义实体上下文index 参考下一篇。

参考文献：

[1] Mendes, Pablo N, Jakob, Max, Garc&#, et al. DBpedia spotlight: Shedding light on the web of documents[C]// Proceedings of the 7th International Conference on Semantic Systems. ACM, 2011:1-8.

[2] Han X, Sun L. A Generative Entity-Mention Model for Linking Entities with Knowledge Base.[J]. Proceeding of Acl, 2011:945-954.

[3] http://lucene.apache.org/

[4] http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html

[5] http://wiki.dbpedia.org/Downloads2014

[6] http://www.oschina.net/p/jieba（结巴分词）