主要研究问题:
给定一个查询(query)和一组文档(document),返回一个排序(ranking),系统根据查询所对应文档的契合度高低排序。
论文主要结构:
一、Abstract
1、基于关键词匹配的潜在语义模型经常失败
2、模型利用历史点击给定查询和一组文档,最大化匹配被点击过的文档的概率
3、采用词哈希技术以便能应对大规模的网络搜索
4、利用真实的网页排名数据做实验,结果显示DSSM明显优于其它模型
二、Introduction
1、潜在语义分析模型(如LSA)本身的约束性,模型质量达到瓶颈
2、互联网极速发展带来的巨量的点击记录,LSA等面临模型存储的挑战
3、词语不采用one-hot形式,而是映射到向量空间给模型的设计提供新的思路
三、Related Work
主要介绍了点击数据中潜在语义模型的应用以及深度学习的介绍
四、Structured Semantic Models
1、Term Vector: “词”向量-词袋模型编码
2、Word Hashing : 词哈希
3、Multi-layer non-linear projection: 多层感知机
4、Relevance measured by cosine similarity: 余弦相似度计算
5、Poster probablity computed by softmax: 通过softmax计算每一个文档匹配的概率
词哈希部分说到,可能会出现冲突的情况,但是概率较低,可以忽略不计
词哈希主要是使形态上相近的单词有相近的表达,有比较强的鲁棒性,大幅度缓解oov问题
trick: 使用xavier初始化网络参数服从均匀分布,使用SGD梯度下降方法
五、Experiments
模型9-没有使用哈希、模型10-去掉全连接部分、模型11-不使用任何非线性函数、模型12-完整的DSSM模型
缺少词哈希对结果影响较大,原因是直接由词袋编码转为稠密化向量不经过词哈希这一层缓冲,信息损失严重
六、Conclusion
创新点:
1、运用深度神经网络(DNN)接收和查询文档
2、利用非线性函数将文本映射到语义空间
3、利用词哈希解决词表过大问题
4、利用余弦相似度计算查询与多个文档间的相似程度
5、不同于word2vec,DSSM为有监督训练
关键点:
1、利用点击数据设计文档排序实验
2、利用非线性激活函数提取语义特征
3、使用词哈希技术使得模型能够应用于大规模世纪生产环境
七、Code
数据集: MRPC(Microsoft Research Paraphrase Corpus,也有的成其为MSRP)数据资源 , MRPC 是由微软研究院提供的开源文本语义相似数据集。句子对来源于对同一条新闻的评论. 判断这一对句子在语义上是否相同
""" 数据处理部分 """
# encoding = 'utf-8'
vocab = []
file_path = "./MRPC/"
files = ['train_data.csv','test_data.csv']
def n_gram(word,n=3):
s = []
word = "#" + word + "#"
for i in range(len(word)-(n-1)):
s.append(word[i:i+3])
return s
def lst_gram(lst,n=3):
s = []
for word in str(lst).lower().split():
s.extend(n_gram(word))
return s
for file in files:
f = open(file_path + file,encoding='utf-8').readlines()
for i in range(1,len(f)):
s1,s2 = f[i][2:].strip('\n').split('\t')
vocab.extend(lst_gram(s1))
vocab.extend(lst_gram(s2))
vocab = set(vocab)
vocab_list = ['[PAD]','[UNK]']
vocab_list.extend(list(vocab))
vocab_file = "save_vocab_mrpc.txt"
with open(vocab_file,'w',encoding='utf-8') as f:
for slice in vocab_file:
f.write(slice)
f.write('\n')
import numpy as np
import pandas as pd
import torch
def load_vocab():
vocab = open(vocab_file,encoding='utf-8').readlines()
slice2idx,idx2slice,cnt = {},{},0
for char in vocab:
char = char.strip('\n')
slice2idx[char] = cnt
idx2slice[cnt] = char
cnt+=1
return slice2idx,idx2slice
def padding(text, maxlen=70):
pad_text = []
for sentence in text:
pad_sentence = np.zeros(maxlen).astype('int64')
cnt = 0
for index in sentence:
pad_sentence[cnt] = index
cnt += 1
if cnt == maxlen:
break
pad_text.append(pad_sentence.tolist())
return pad_text
def char_index(text_a,text_b):
slice2idx,idx2slice = load_vocab()
a_list,b_list = [],[]
for a_sentence,b_sentence in zip(text_a,text_b):
a,b = [],[]
for slice in lst_gram(a_sentence):
if slice in slice2idx.keys():
a.append(slice2idx[slice])
else:
a.append(1)
for slice in lst_gram(b_sentence):
if slice in slice2idx.keys():
b.append(slice2idx[slice])
else:
b.append(1)
a_list.append(a)
b_list.append(b)
a_list = padding(a_list)
b_list = padding(b_list)
return a_list,b_list
def load_char_data(file_name):
import pandas as pd
df = pd.read_csv(file_name,sep='\t')
text_a = df['#1 string'].values
text_b = df['#2 string'].values
label = df['quality'].values
a_index,b_index = char_index(text_a,text_b)
return np.array(a_index),np.array(b_index),np.array(label)
a_index,b_index,label = load_char_data('./MRPC/test_data.csv')
""" 模型构建部分 """
import torch
import torch.nn as nn
from torch.utils.data import DataLoader,Dataset
from torch.autograd import Variable
class DSSM(torch.nn.Module):
def __init__(self):
super(DSSM,self).__init__()
self.embedding = nn.Embedding(CHAR_SIZE,embedding_size)
self.linear1 = nn.Linear(embedding_size,256)
self.linear2 = nn.Linear(256,128)
self.linear3 = nn.Linear(128,64)
self.dropout = nn.Dropout(p=0.2)
def forward(self,a,b):
a = self.embedding(a).sum(1)
b = self.embedding(b).sum(1)
a = torch.tanh(self.linear1(a))
a = self.dropout(a)
a = torch.tanh(self.linear2(a))
a = self.dropout(a)
a = torch.tanh(self.linear3(a))
a = self.dropout(a)
b = torch.tanh(self.linear1(b))
b = self.dropout(b)
b = torch.tanh(self.linear2(b))
b = self.dropout(b)
b = torch.tanh(self.linear3(b))
b = self.dropout(b)
cosine = torch.cosine_similarity(a,b,dim=1,eps=1e-8)
return cosine
def _initialize_weights(self):
for m in self.modules():
if isinstance(m,nn.Linear):
torch.nn.init.xavier_uniform_(m.weight)
class MRPCDataset(Dataset):
def __init__(self,filepath):
self.path = filepath
self.a_index,self.b_index,self.label = load_char_data(filepath)
def __len__(self):
return len(self.a_index)
def __getitem__(self,idx):
return self.a_index[idx],self.b_index[idx],self.label[idx]
""" 模型训练部分 """
CHAR_SIZE=10041
embedding_size=300
EPOCH=50
BATCH_SIZE=50
LR=0.0005
data_root='./MRPC/'
train_path=data_root+'train_data.csv'
test_path=data_root+'test_data.csv'
#1、创建数据集并创立数据载入器
train_data=MRPCDataset(train_path)
test_data=MRPCDataset(test_path)
train_loader=DataLoader(dataset=train_data,batch_size=BATCH_SIZE,shuffle=True)
test_loader=DataLoader(dataset=test_data,batch_size=BATCH_SIZE,shuffle=True)
#2、有gpu用gpu,否则cpu
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
dssm=DSSM().to(device)
dssm._initialize_weights()
#3、定义优化方式和损失函数
optimizer=torch.optim.Adam(dssm.parameters(),lr=LR)
loss_func=nn.CrossEntropyLoss()
for epoch in range(EPOCH):
for step,(text_a,text_b,label) in enumerate(train_loader):
#1、把索引转化为tensor变量,载入设备,注意转化成long tensor
a=Variable(text_a.to(device).long())
b=Variable(text_b.to(device).long())
l=Variable(torch.LongTensor(label).to(device))
#2、计算余弦相似度
pos_res=dssm(a,b)
neg_res=1-pos_res
#3、预测结果传给loss
out=torch.stack([neg_res,pos_res],1).to(device)
loss=loss_func(out,l)
#4、固定格式
optimizer.zero_grad()
loss.backward()
optimizer.step()
if (step+1) % 20 == 0:
total=0
correct=0
for (test_a,test_b,test_l) in test_loader:
tst_a=Variable(test_a.to(device).long())
tst_b=Variable(test_b.to(device).long())
tst_l=Variable(torch.LongTensor(test_l).to(device))
pos_res=dssm(tst_a,tst_b)
neg_res=1-pos_res
out=torch.max(torch.stack([neg_res,pos_res],1).to(device),1)[1]
if out.size()==tst_l.size():
total+=tst_l.size(0)
correct+=(out==tst_l).sum().item()
print('[Epoch]:',epoch+1,'训练loss:',loss.item())
print('[Epoch]:',epoch+1,'测试集准确率: ',(correct*1.0/total))
torch.save(dssm, './dssm.pkl')