命名体是识别后关系抽取及知识图谱扩充(第一步,数据集构建)

项目场景:

这里我就不介绍碳酸盐台地知识图谱的构建, 如果想看用什么方法构建知识图谱,就看一下前面的博客已经有详细的介绍,这里我就不赘述了。这里还是紧接着用已经完成前面的命名体识别,接着做关系抽取,最终反向补全原有的知识图谱,再经过专家的核对,再将词典补充,重新抽取命名体识别,再再做关系抽取,再反向补全,最最终完善起来。这就是学术中的human in the loop(人在回路),这是整个大类的第二步。


问题描述

目前依旧是那个问题,我缺少数据集,比之前好的是,我有很多已经命名体识别出的文献和中间文本(含有命名体的语句),这我只需要做一些工程,这样就变得很简单。


原因分析:

这里是关系提取需要转变一下思想,我们不需要做具体的词和具体词之间的关系,只需要关键词和关键词之间的关系。换句话说就是我不需要知道是 A Formation 还是 B Formation 还是 AB Formation 对应的某个物质,这是为什么呢,因为我们有具体物质层级关系结构,如图我只要知道Formation 和某个物质父子关系就行,不需要具体名词,在确定关系之后,我们再对这句话进行实体抽取。为什么要这样,而不是实体抽取完了,再关系抽取,这只是对这个项目而言,这样减少数据的冗余,减少运算时间。那下一步就是进行数据的构建
在这里插入图片描述
按照上述思路,数据集的构建就很简单,因为只要将词典做简单层级分类就知道谁和谁是父子类关系,父类包含M个实体,子类包含N个实体,那M1对应N3依旧会是父子类关系,因此,我们将这些实体放入固定模板中,那有人会问了,存不存在两个实体多个意思呢,βヾ(,・∇・,川这是不存在的,如果存在就不会放入固定模板这么傻的例子了,只是为了节约时间训练出可用模型,之后等实体关系更多可以扩展。


代码实现:

以时间和物质关系为例,构建模型代码如下:

tim=[]
with open("./duc.txt", "r", encoding="utf-8") as f:
    lines = f.readlines()
    # 去除换行符
    result = ([x.strip() for x in lines if x.strip() != ''])
    # 将整理好字典提取成为全局字典
    for x in result:
        tim.append(x)
sub=[]
with open("./geosubstance1.txt", "r", encoding="utf-8") as f:
    lines = f.readlines()
    # 去除换行符
    result = ([x.strip() for x in lines if x.strip() != ''])
    # 将整理好字典提取成为全局字典
    for x in result:
        sub.append(x)
Result=[]
for Sindex in range(len(sub)):
    for Tindex in range(len(tim)):
        Result.append(sub[Sindex]+' | '+tim[Tindex]+' | unknown | '+tim[Tindex]+' foreland flexure been also accommodated '+sub[Sindex])
for index in range(len(Result)):
    print(Result[index])
f = open('./middle/general.txt', "w", encoding='utf-8')
for line in Result:
    f.write(line + '\n')
print("保存成功")
f.close()
print("okA")

最终经过不停的换父子级数据,生成最后的关系抽取数据集部分如下:

Cement | Cretaceous | unknown | Cretaceous foreland flexure been also accommodated Cement
Cement | Berriasian | unknown | Berriasian foreland flexure been also accommodated Cement
Cement | Valanginian | unknown | Valanginian foreland flexure been also accommodated Cement
Cement | Hauterivian | unknown | Hauterivian foreland flexure been also accommodated Cement
Cement | Barremian | unknown | Barremian foreland flexure been also accommodated Cement
Cement | Aptian | unknown | Aptian foreland flexure been also accommodated Cement
Cement | Albian | unknown | Albian foreland flexure been also accommodated Cement
Cement | Cenomanian | unknown | Cenomanian foreland flexure been also accommodated Cement
Cement | Turonian | unknown | Turonian foreland flexure been also accommodated Cement
Cement | Coniacian | unknown | Coniacian foreland flexure been also accommodated Cement
Cement | Santonian | unknown | Santonian foreland flexure been also accommodated Cement
Cement | Campanian | unknown | Campanian foreland flexure been also accommodated Cement
Cement | Maastrichtian | unknown | Maastrichtian foreland flexure been also accommodated Cement
Acicular | Cretaceous | unknown | Cretaceous foreland flexure been also accommodated Acicular
Acicular | Berriasian | unknown | Berriasian foreland flexure been also accommodated Acicular
Acicular | Valanginian | unknown | Valanginian foreland flexure been also accommodated Acicular
Acicular | Hauterivian | unknown | Hauterivian foreland flexure been also accommodated Acicular
Acicular | Barremian | unknown | Barremian foreland flexure been also accommodated Acicular
Acicular | Aptian | unknown | Aptian foreland flexure been also accommodated Acicular
Acicular | Albian | unknown | Albian foreland flexure been also accommodated Acicular
Acicular | Cenomanian | unknown | Cenomanian foreland flexure been also accommodated Acicular
Acicular | Turonian | unknown | Turonian foreland flexure been also accommodated Acicular

好了第一步已经完成,后面就是代码训练,在下一篇博客中代码数据会一起上传到github中。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

经常喝假酒的胡小臣

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值