基于规则嵌入的论文比对系统（6）-数据处理3

qq_43665502

于 2020-06-15 10:20:24 发布

阅读量361

点赞数

本文链接：https://blog.csdn.net/qq_43665502/article/details/106756855

版权

数据处理-3

环境
思路
文本序列化处理
总结

环境

python3 keras=2.0.6

思路

前面已经在5个子空间都构建了训练集，接下来需要对文本进行序列化的处理。

文本序列化处理

word2vec生成词向量

模型介绍

CBOW(用上下文来预测当前词)
skip-gram（用当前词来预测上下文）

超参设置

选择模型skip-gram
词向量的长度设置为256
迭代次数为8
window默认大小为5

代码

import logging
import gensim
from gensim.models import word2vec
# 设置输出日志
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
# 直接用gemsim提供的API去读取txt文件，读取文件的API有LineSentence 和 Text8Corpus, PathLineSentences等。
sentences = word2vec.LineSentence("ACM_dataset/abstracts.txt")
# 训练模型，词向量的长度设置为256， 迭代次数为8，采用skip-gram模型，模型保存为bin格式
model = gensim.models.Word2Vec(sentences, size=256, sg=1, iter=8)  
#model.wv.save_word2vec_format("model/word2Vec" + ".bin", binary=True) 
model.wv.save_word2vec_format('model/word2vec.txt',binary = False)
# 加载bin格式的模型
#wordVec = gensim.models.KeyedVectors.load_word2vec_format("model/word2Vec.bin", binary=True)

运行过程截图

在这里插入图片描述

生成的word2vec.txt文件部分截图

说明：第一行两个数字分别是word的个数和词向量的长度
在这里插入图片描述

生成词向量矩阵

简要说明

这里生成的嵌入矩阵就是后面构建模型的时候对嵌入层进行初始化的时候用的矩阵。

代码

print('word embedding')
embeddings_index = {}
word_index={}
embedding_max_value = 0
embedding_min_value = 1
i=1
with open(config.WORD_EMBEDDING_DIR, 'r') as f:
    for line in f:
        line = line.strip().split(' ')
        if len(line) != 257:
            print("error!")
            word_num=int(line[0])
            continue
        coefs = np.asarray(line[1:], dtype='float32')
        if np.max(coefs) > embedding_max_value:
            embedding_max_value = np.max(coefs)
        if np.min(coefs) < embedding_min_value:
            embedding_min_value = np.min(coefs)
        embeddings_index[line[0]] = coefs
        word_index[line[0]]=i
        i=i+1
embedword_matrix = np.zeros((word_num+1, 256))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedword_matrix[i] = embedding_vector
    else:#没有找到的词会被当做全0向量处理
        embedword_matrix[i] = np.random.uniform(low=embedding_min_value, high=embedding_max_value,
                                                             size=256)

结果截图

size是（29569，256）
在这里插入图片描述

sentence2sequence

简要说明

这里主要是为后面构建模型的输入做处理，模型的输入为论文对的sequence表示（包含语义和规则两部分）,该部分主要是针对语义的处理。

代码

这里只展示了对于子空间0的处理，后面会把这些代码都包装进函数里面，以便调用

语义部分

index_list=list(word_index.keys()) 
#子空间0的100个正样本的句子序列化（借助word_index）
list0_index_pos_first=[]    
for each in Max100_0_pos_list:
    temp_str=SubSpace0_dict[each[0]]
    temp_list_word=temp_str.split(" ")
    temp_list_index=[]
    for i in temp_list_word:
        if i in index_list:
            temp_list_index.append(word_index[i])
    list0_index_pos_first.append(temp_list_index)
pos_index_pad_array0_first = pad_sequences(list0_index_pos_first, maxlen=150)

list0_index_pos_sec=[] 
for each in Max100_0_pos_list:
    temp_str=SubSpace0_dict[each[1]]
    temp_list_word=temp_str.split(" ")
    temp_list_index=[]
    for i in temp_list_word:
        if i in index_list:
            temp_list_index.append(word_index[i])
    list0_index_pos_sec.append(temp_list_index)
pos_index_pad_array0_second = pad_sequences(list0_index_pos_sec, maxlen=150)

#子空间0的100个负样本的句子序列化（借助word_index）
list0_index_neg_first=[]  
for each in Min100_0_neg_list:
    temp_str=SubSpace0_dict[each[0]] 
    temp_list_word=temp_str.split(" ")
    temp_list_index=[]
    for i in temp_list_word:
        if i in index_list:
            temp_list_index.append(word_index[i])
    list0_index_neg_first.append(temp_list_index)
neg_index_pad_array0_first = pad_sequences(list0_index_neg_first, maxlen=150)
list0_index_neg_sec=[]  
for each in Min100_0_neg_list:
    temp_str=SubSpace0_dict[each[1]]+" "+SubSpace0_dict[each[1]]
    temp_list_word=temp_str.split(" ")
    temp_list_index=[]
    for i in temp_list_word:
        if i in index_list:
            temp_list_index.append(word_index[i])
    list0_index_neg_sec.append(temp_list_index)
neg_index_pad_array0_second = pad_sequences(list0_index_neg_sec, maxlen=150) 
index_pad_array0_first=np.concatenate((pos_index_pad_array0_first,neg_index_pad_array0_first),axis=0)
index_pad_array0_second=np.concatenate((pos_index_pad_array0_second,neg_index_pad_array0_second),axis=0)

规则
说明：关于我们设置的三条规则的处理，其他同学还在进行中，所以这里我就先随机生成了一些数，后面这里会修改

threeRules=[[2]*3]*200
threeRules=np.array(threeRules)

true label
说明：前100个是正样本，后100个是负样本

pos_list=[[1,0]]*100
neg_list=[[0,1]]*100
y=pos_list+neg_list
y_train= np.asarray(y).astype('float32')

运行结果可视化展示

index_pad_array0_first(后面的数字是sentence中的word的编号，前面用0补齐)
size是（200，150）
即没一篇论文在某个子空间上的语义部分都使用一个长度为150的sequence表示
下面展示的是训练集中所有论文对的第一篇论文的sequence表示
index_pad_array0_second(和上图的唯一区别就是它是所有论文对的第二篇论文的sequence表示)
这两个数组是后面的模型input的主要部分。