基于规则嵌入的论文比对系统（11）-预测输入数据处理+模型完善

最新推荐文章于 2020-12-30 00:22:43 发布

qq_43665502

最新推荐文章于 2020-12-30 00:22:43 发布

阅读量196

点赞数

本文链接：https://blog.csdn.net/qq_43665502/article/details/106909775

版权

预测输入数据处理+模型完善

预测的输入数据的处理
- 根据输入的两篇论文的paperId，以及选择的子空间，输出两篇论文的子空间的序列化表示
模型完善

预测的输入数据的处理

根据输入的两篇论文的paperId，以及选择的子空间，输出两篇论文的子空间的序列化表示

这个函数在之前的博客提到过，这里我又做了一下修改，因为并不是每一篇论文在每个子空间上都有相应的sentence，因为后面测试的时候，报了索引的错误，所以在这里又修改了一下。

def test2sequence(firstId,secondId,SubSpace_dict):
    subspace_keys=SubSpace0_dict.keys()
    if firstId in subspace_keys:
        temp_str1=SubSpace_dict[firstId]
        temp_list_word1=temp_str1.split(" ")
        temp_list_index1=[]    
        for i in temp_list_word1:
            if i in index_list:
                temp_list_index1.append(word_index[i])
        pad_array_first = pad_sequences([temp_list_index1], maxlen=150) 
    elif firstId not in subspace_keys:
        pad_array_first=np.array([[0]*150])
    if secondId in subspace_keys:
        temp_str2=SubSpace_dict[secondId]
        temp_list_word2=temp_str2.split(" ")
        temp_list_index2=[]    
        for i in temp_list_word2:
            if i in index_list:
                temp_list_index2.append(word_index[i])
        pad_array_second = pad_sequences([temp_list_index2], maxlen=150) 
    elif secondId not in subspace_keys:
        pad_array_second=np.array([[0]*150])
    return pad_array_first,pad_array_second

模型完善

规则部分的处理

这里写了一个函数，就是通过调用其他几位同学对规则的处理的函数，输入是两篇论文的paperId,输出是规则的list。

def rulesEmbbeding(firstId,secondId):
    firstRule=referenceJaccard(firstId,secondId)
    secondRule=PaperId2KeywordsJaccard(firstId,secondId)
    thirdRule=ccsSimilarity(firstId,secondId)
    fourthRule=textJaccard(firstId,secondId)
    FourRules=[firstRule,secondRule,thirdRule,fourthRule]
    return FourRules

模型prediction

这里我把模型的预测部分移到mymodel这个类之外了。然后为了满足输入是论文list,输出是在某个子空间上相似论文对list的的功能，对这个函数的功能进行了补充。

def predmodel(modelname,PaperIdList,SubSpace_dict):
    AllPaperPairs=list(itertools.combinations(PaperIdList, 2))
    first_list=[]
    second_list=[]
    FourRules_list=[]
    for each in AllPaperPairs:
        pad_array_first,pad_array_second=test2sequence(each[0],each[1],SubSpace_dict)
        a=pad_array_first.tolist()[0]
        b=pad_array_second.tolist()[0]
        c=rulesEmbbeding(each[0],each[1])
        first_list.append(a)
        second_list.append(b)
        FourRules_list.append(c)
    index_pad_array_first=np.array(first_list)
    index_pad_array_second=np.array(second_list)
    FourRules=np.array(FourRules_list)
    model=load_model(modelname)
    predlabel =model.predict([index_pad_array_first, index_pad_array_second, FourRules],
                                           batch_size=512, verbose=1)
    predlabel_list=predlabel.tolist()
    finalresult=[]
    for i in range(0,len(AllPaperPairs)):
        if predlabel_list[i][0]>0.7:
            finalresult.append(AllPaperPairs[i])
    return finalresult

说明：

AllPaperPairs=list(itertools.combinations(PaperIdList, 2))
#这一句是生成输入论文list的任意两个的组合

另外，暂时设置的如果是正样本的概率大于0.7,则判定为在该子空间上是相似的。

模型训练

代码

##-----子空间0-------#############---模型训练------############################################################################################
index_pad_array0_first,index_pad_array0_second=sample2sequence(Max100_0_pos_list,Min100_0_neg_list,SubSpace0_dict)
FourRules0=[]
for i in range(0,100):
    temp_list=rulesEmbbeding(Max100_0_pos_list[i][0],Max100_0_pos_list[i][1])
    FourRules0.append(temp_list)   
for i in range(0,100):
    temp_list=rulesEmbbeding(Min100_0_neg_list[i][0],Max100_0_pos_list[i][1])
    FourRules0.append(temp_list)   
FourRules0=np.array(FourRules0)
pos_list=[[1,0]]*100
neg_list=[[0,1]]*100
y0=pos_list+neg_list
model0=MyModel(batch_size=128, num_epochs=config.NUM_EPOCHES, word_index=word_index, subId=0,
                 index_pad_array_first=index_pad_array0_first, index_pad_array_second=index_pad_array0_second,FourRules=FourRules0,y=y0)
model0.trainmodel()
##-----子空间1--------#################---模型训练------############################################################################################
index_pad_array1_first,index_pad_array1_second=sample2sequence(Max100_1_pos_list,Min100_1_neg_list,SubSpace1_dict)
FourRules1=[]
for i in range(0,100):
    temp_list=rulesEmbbeding(Max100_1_pos_list[i][0],Max100_1_pos_list[i][1])
    FourRules1.append(temp_list)  
for i in range(0,100):
    temp_list=rulesEmbbeding(Min100_1_neg_list[i][0],Max100_1_pos_list[i][1])
    FourRules1.append(temp_list)
FourRules1=np.array(FourRules1)
pos_list=[[1,0]]*100
neg_list=[[0,1]]*100
y1=pos_list+neg_list
model1=MyModel(batch_size=128, num_epochs=config.NUM_EPOCHES, word_index=word_index, subId=1,
                 index_pad_array_first=index_pad_array1_first, index_pad_array_second=index_pad_array1_second,FourRules=FourRules1,y=y1)
model1.trainmodel()
##-----子空间2--------#################---模型训练------############################################################################################
index_pad_array2_first,index_pad_array2_second=sample2sequence(Max100_2_pos_list,Min100_2_neg_list,SubSpace2_dict)
FourRules2=[]
for i in range(0,100):
    temp_list=rulesEmbbeding(Max100_2_pos_list[i][0],Max100_2_pos_list[i][1])
    FourRules2.append(temp_list)  
for i in range(0,100):
    temp_list=rulesEmbbeding(Min100_2_neg_list[i][0],Max100_2_pos_list[i][1])
    FourRules2.append(temp_list)
FourRules2=np.array(FourRules2)
pos_list=[[1,0]]*100
neg_list=[[0,1]]*100
y2=pos_list+neg_list
model2=MyModel(batch_size=128, num_epochs=config.NUM_EPOCHES, word_index=word_index, subId=2,
                 index_pad_array_first=index_pad_array2_first, index_pad_array_second=index_pad_array2_second,FourRules=FourRules2,y=y2)
model2.trainmodel()
##-----子空间3--------#################---模型训练------############################################################################################
index_pad_array3_first,index_pad_array3_second=sample2sequence(Max100_3_pos_list,Min100_3_neg_list,SubSpace3_dict)
FourRules3=[]
for i in range(0,100):
    temp_list=rulesEmbbeding(Max100_3_pos_list[i][0],Max100_3_pos_list[i][1])
    FourRules3.append(temp_list) 
for i in range(0,100):
    temp_list=rulesEmbbeding(Min100_3_neg_list[i][0],Max100_3_pos_list[i][1])
    FourRules3.append(temp_list)
FourRules3=np.array(FourRules3)
pos_list=[[1,0]]*100
neg_list=[[0,1]]*100
y3=pos_list+neg_list
model3=MyModel(batch_size=128, num_epochs=config.NUM_EPOCHES, word_index=word_index, subId=3,
                 index_pad_array_first=index_pad_array3_first, index_pad_array_second=index_pad_array3_second,FourRules=FourRules3,y=y3)
model3.trainmodel()
##-----子空间4--------#################---模型训练------############################################################################################
index_pad_array4_first,index_pad_array4_second=sample2sequence(Max100_4_pos_list,Min100_4_neg_list,SubSpace4_dict)
FourRules4=[]
for i in range(0,100):
    temp_list=rulesEmbbeding(Max100_4_pos_list[i][0],Max100_4_pos_list[i][1])
    FourRules4.append(temp_list)
for i in range(0,100):
    temp_list=rulesEmbbeding(Min100_4_neg_list[i][0],Max100_4_pos_list[i][1])
    FourRules4.append(temp_list)
FourRules4=np.array(FourRules4)
pos_list=[[1,0]]*100
neg_list=[[0,1]]*100
y4=pos_list+neg_list
model4=MyModel(batch_size=128, num_epochs=config.NUM_EPOCHES, word_index=word_index, subId=4,
                 index_pad_array_first=index_pad_array4_first, index_pad_array_second=index_pad_array4_second,FourRules=FourRules4,y=y4)
model4.trainmodel()

训练过程部分截图

在这里插入图片描述

模型测试

代码

这里直接调用上面写好的predmodel函数，随便写了几个论文的paperId,子空间选择的0：

PaperIdList=[102, 114,156,157,164,171,172,173,174,175,177,190,191,192,193,195,205,206,207,230,262,263]
finalresult=predmodel("model/model0.h5",PaperIdList,SubSpace0_dict)

运行结果截图

在这里插入图片描述
在子空间0上生成的相似论文对的list

在这里插入图片描述

qq_43665502

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
基于规则嵌入的论文比对系统（11）-预测输入数据处理+模型完善

模型完善+预测输入数据处理模型完善规则部分的处理模型prediction预测的输入数据的处理根据输入的两篇论文的paperId，以及选择的子空间，输出两篇论文的子空间的序列化表示模型完善规则部分的处理这里写了一个函数，就是通过调用其他几位同学对规则的处理的函数，输入是两篇论文的paperId,输出是规则的list。def rulesEmbbeding(firstId,secondId): firstRule=referenceJaccard(firstId,secondId) sec
复制链接

扫一扫