FAQ式问答系统
最终效果
系统架构
项目描述
对话系统(Dialogue Systems)又可以称之为聊天机器人(ChatBot),主要是实现自动与用户进行对话的功能。帮助用户完成某些具体的任务(下单、打车、订座等)的对话系统可以称之为任务导向型(Task-oriented)的对话系统;解答用户的某些问题(询问天气、股价、交通等)的对话系统可以称之为问答型(QA-based)的对话系统;除此之外,还有和用户聊天的闲聊型(Chatting) 对话系统。大多数的对话系统都是混合了几种类型的功能。
对话系统中语言的生成主要可以分为两种方式:检索式(Retrieval)和生成式(Generative)。检索式方法我们一般会构建一个语料库,为 FAQ 存放query-response pairs,然后用户发起一个新的 query 时,我们去匹配为这个query 检索最佳的 response。上述过程一般又可以分为召回(Retrieve)和排序(ranking)两个部分:召回即通过 query 找到语料库中最相似的几十个或几百个 query,大大减小候选Response 的数量。这一部分我们一般采用轻量级的方法,如倒排索引(Inverted Index)和近似近邻搜索(ApproximateNearest Neighbor Search)等进行快速检索。排序则是对召回的结果进行进一步的筛选,可以构建更复杂的特征,使用机器学习或深度学习的方法来进行排序。
一、意图识别
这个地方我们做成一个简单的文本二分类任务,根据用户的开场白识别用户的意图是业务需求还是闲聊。若判断为业务,则输入到检索模型,若为闲聊。则输入到聊天生成模型
1.fasttext介绍
fastText是Facebook于2016年开源的一个词向量计算和文本分类工具,在学术上并没有太大创新。但是它的优点也非常明显,在文本分类任务中,fastText(浅层网络)往往能取得和深度网络相媲美的精度,却在训练时间上比深度网络快许多数量级。在标准的多核CPU上, 能够训练10亿词级别语料库的词向量在10分钟之内,能够分类有着30万多类别的50多万句子在1分钟之内。
fasttext模型结构
注意:此架构图没有展示词向量的训练过程。可以看到,和CBOW一样,fastText模型也只有三层:输入层、隐含层、输出层(Hierarchical Softmax),输入都是多个经向量表示的单词,输出都是一个特定的target,隐含层都是对多个词向量的叠加平均。不同的是,CBOW的输入是目标单词的上下文,fastText的输入是多个单词及其n-gram特征,这些特征用来表示单个文档;CBOW的输入单词被onehot编码过,fastText的输入特征是被embedding过;CBOW的输出是目标词汇,fastText的输出是文档对应的类标。
值得注意的是,fastText在输入时,将单词的字符级别的n-gram向量作为额外的特征。
常用的特征是词袋模型(将输入数据转化为对应的Bow形式)。但词袋模型不能考虑词之间的顺序,因此 fastText 还加入了 N-gram 特征。
“我 爱 她” 这句话中的词袋模型特征是 “我”,“爱”, “她”。这些特征和句子 “她 爱 我” 的特征是一样的。
如果加入 2-Ngram,第一句话的特征还有 “我-爱” 和 “爱-她”,这两句话 “我 爱 她” 和 “她 爱 我” 就能区别开来了。当然,为了提高效率,我们需要过滤掉低频的 N-gram。
fastText的核心思想就是:将整篇文档的词及n-gram向量叠加平均得到文档向量,然后使用文档向量做softmax多分类。
2.数据格式
train_no_blank.csv
ware.txt
ware.txt用于生成key_word.txt
通过判断train_no_blank.csv中的句子是否包含key_word.txt中的内容去打标注,若包含则标注为__label__1(为什么是__label__1而不是1呢?因为我们使用fasttext去做意图识别,fasttext中要求的数据格式为__label__1或__label__0),否则标注为__label__0,__label__1为业务,__label__0为闲聊。
最终的训练数据格式为:
3.Code
#%%
import logging
import sys
import os
import fasttext
import jieba.posseg as pseg
import pandas as pd
from tqdm import tqdm
import config
from config import root_path
from preprocessor import clean, filter_content
# %%
# %%
tqdm.pandas()
# %%
logging.basicConfig(format="%(levelname)s - %(asctime)s: %(message)s",
datefmt="%H:%M:%S",
level=logging.INFO)
# %%
class Intention(object):
def __init__(self,
data_path=config.train_path,
sku_path=config.ware_path,
model_path=None,
kw_path=None,
model_train_file=config.business_train,
model_test_file=config.business_test):
self.model_path = model_path
self.data = pd.read_csv(data_path , encoding='utf-8')
if model_path and os.path.exists(model_path):
self.fast = fasttext.load_model(model_path)
else:
self.kw = self.build_keyword(sku_path, to_file=kw_path)
self.data_process(model_train_file) # Create
self.fast = self.train(model_train_file, model_test_file)
def build_keyword(self, sku_path, to_file):
logging.info('Building keywords.')
tokens = []
tokens = self.data['custom'].dropna().apply(
lambda x: [
token for token, pos in pseg.cut(x) if pos in ['n', 'vn', 'nz']
])
key_words = set(
[tk for idx, sample in tokens.iteritems()
for tk in sample if len(tk) > 1])
logging.info('Key words built.')
sku = []
with open(sku_path, 'r' , encoding='utf-8') as f:
next(f)
for lines in f:
line = lines.strip().split('\t')
sku.extend(line[-1].split('/'))
key_words |= set(sku)
logging.info('Sku words merged.')
if to_file is not None:
with open(to_file, 'w' , encoding='utf-8') as f:
for i in key_words:
f.write(i + '\n')
return key_words
def data_process(self, model_data_file):
logging.info('Processing data.')
self.data['is_business'] = self.data['custom'].progress_apply(
lambda x: 1 if any(kw in x for kw in self.kw) else 0)
with open(model_data_file, 'w' , encoding='utf-8') as f:
for index, row in tqdm(self.data.iterrows(),
total=self.data.shape[0]):
outline = clean(row['custom']) + "\t__label__" + str(
int(row['is_business'])) + "\n"
f.write(outline)
def train(self, model_data_file, model_test_file):
logging.info('Training classifier.')
classifier = fasttext.train_supervised(model_data_file,
label="__label__",
dim=100,
epoch=5,
lr=0.1,
wordNgrams=2,
loss='softmax',
thread=5,
verbose=True)
self.test(classifier, model_test_file)
classifier.save_model(self.model_path)
logging.info('Model saved.')
return classifier
def test(self, classifier, model_test_file):
logging.info('Testing trained model.')
test = pd.read_csv(config.test_path).fillna('')
test['is_business'] = test['custom'].progress_apply(
lambda x: 1 if any(kw in x for kw in self.kw) else 0)
with open(model_test_file, 'w' , encoding='utf-8') as f:
for index, row in tqdm(test.iterrows(), total=test.shape[0]):
outline = clean(row['custom']) + "\t__label__" + str(
int(row['is_business'])) + "\n"
f.write(outline)
result = classifier.test(model_test_file)
# F1 score
print(result[1] * result[2] * 2 / (result[2] + result[1]))
def predict(self, text):
logging.info('Predicting.')
label, score = self.fast.predict(clean(filter_content(text)))
return label, score
#%%
if __name__ == "__main__":
it = Intention(config.train_path,
config.ware_path,
model_path=config.ft_path,
kw_path=config.keyword_path)
print(it.predict('你最近怎么样'))
print(it.predict('你好手机多少钱'))
output:=
二、检索模型
我们将用 Approximate Nearest Neighbor Search(ANNS) 方法中较为常用的 Hierarchical Navigable Small World(HNSW) 来做召回
的部分;然后将构建各种相似度特征(包括深度匹配网络),并利用LightGBM 来训练一个 Learning2Rank 模型。
想了解HNSW的原理的同学可以看这篇文章HNSW原理
1.预处理
因为hnsw不能直接存文本,因此我们需要将其转换为向量,这里我们使用每句话的词向量的平均值表示句向量,计算过程如下:
def wam(sentence, w2v_model):
arr = []
for s in clean(sentence).split():
if s not in w2v_model.wv.vocab.keys():
arr.append(np.random.randn(1, 300))#如果词向量不在词向量模型中,则随机初始化一个300维的向量
else:
arr.append(w2v_model.wv.get_vector(s))#从词向量模型中取词向量
return np.mean(np.array(arr), axis=0).reshape(1, -1)
2.构建hnsw图
class HNSW(object):
def __init__(self,
w2v_path,
data_path=None,
ef=config.ef_construction,
M=config.M,
model_path=config.hnsw_path):
self.w2v_model = KeyedVectors.load(w2v_path)
self.data = self.data_load(data_path)
if model_path and os.path.exists(model_path):
# 加载
self.hnsw = self.load_hnsw(model_path)
else:
# 训练
self.hnsw = \
self.build_hnsw(os.path.join(config.root_path, 'model/retrieval/hnsw.bin'),
ef=ef,
m=M)
def data_load(self, data_path):
data = pd.read_csv(
data_path)
data['custom_vec'] = data['custom'].apply(
lambda x: wam(x, self.w2v_model))
data['custom_vec'] = data['custom_vec'].apply(
lambda x: x[0][0] if x.shape[1] != 300 else x)
data = data.dropna()
return data
def build_hnsw(self, to_file, ef=2000, m=64):
logging.info('build_hnsw')
dim = self.w2v_model.vector_size
num_elements = self.data['custom'].shape[0]
hnsw = np.stack(self.data['custom_vec'].values).reshape(-1, 300)
p = hnswlib.Index(space='l2',
dim=dim) # possible options are l2, cosine or ip
p.init_index(max_elements=num_elements, ef_construction=ef, M=m)
p.set_ef(10)
p.set_num_threads(8)
p.add_items(hnsw)
logging.info('Start')
labels, distances = p.knn_query(hnsw, k=1)
print('labels: ', labels)
print('distances: ', distances)
logging.info("Recall:{}".format(
np.mean(labels.reshape(-1) == np.arange(len(hnsw)))))
p.save_index(to_file)
return p
def load_hnsw(self, model_path):
hnsw = hnswlib.Index(space='l2', dim=self.w2v_model.vector_size)
hnsw.load_index(model_path)
return hnsw
def search(self, text, k=5):
test_vec = wam(clean(text), self.w2v_model)
q_labels, q_distances = self.hnsw.knn_query(test_vec, k=k)
return pd.concat(
(self.data.iloc[q_labels[0]]['custom'].reset_index(),
self.data.iloc[q_labels[0]]['assistance'].reset_index(drop=True),
pd.DataFrame(q_distances.reshape(-1, 1), columns=['q_distance'])),
axis=1)
if __name__ == "__main__":
hnsw = HNSW(config.w2v_path,
config.train_path,
config.ef_construction,
config.M,
config.hnsw_path
)
test = '在手机上下载'
result = hnsw.search(test, k=10)
重要参数space
至此,召回模型已完成,接下来,将召回模型输出的结果作为输入,通过learning to rank去输出最终结果
3.Learning To Rank
这一步我们需要构建多种相似度特征,主要可以分为几类:
- 基于字符串距离的(编辑距离、列文斯坦距离、LCS);
- 基于向量距离的(cosine、Euclidian、Jaccard、WMD);
- 基于统计量的(BM25、Pearson Correlation);
- 基于深度匹配模型的。构建完特征后,我们使用 LightGBM 来训练一个 Learning To Rank模型
#LCS
def lcs(self , str_a , str_b):
lengths = [[0 for j in range(len(str_b) + 1 )]
for i in range(len(str_a) + 1)]
for i,x in enumerate(str_a):
for j,y in enumerate(str_b):
if x==y:
lengths[i+1][j+1] = lengths[i][j] + 1
else:
lengths[i+1][j+1] = max(lengths[i+1][j] , lengths[i][j+1])
result = ""
x,y = len(str_a) , len(str_b)
while x !=0 and y !=0:
if lengths[x][y] == lengths[x - 1][y]:
x -= 1
elif lengths[x][y] == lengths[x][y-1]:
y -= 1
else:
assert str_a[x-1] == str_b[y-1]
result = str_a[x-1] + result
x -= 1
y -= 1
longestdist = lengths[len(str_a)][len(str_b)]
ratio = longestdist / min(len(str_a) , len(str_b))
return ratio
def editDistance(self , str1 , str2):
m = len(str1)
n = len(str2)
lensum = float(m + n)
d = [[0] * (n+1) for _ in range(m+1)]
for i in range(m+1):
d[i][0] = i
for j in range(n+1):
d[0][j] = j
for j in range(1 , n+1):
for i in range(1 , m+1):
if str1[i -1] == str2[j -1]:
d[i][j] = d[i-1][j-1]
else:
d[i][j] = min(d[i-1][j] , d[i][j-1] , d[i-1][j-1]) + 1
dist = d[-1][-1]
ratio = (lensum -dist) / lensum
return ratio
def JaccardSim(self , str_a , str_b):
seta = self.tokenize(str_a)[1]
setb = self.tokenize(str_b)[1]
sa_sb = 1.0 * len(seta & setb) / len(seta | setb)
return sa_sb
def cos_sim(a ,b):
a = np.array(a)
b = np.array(b)
return np.sum(a * b) / (np.sqrt(np.sum(a**2)) * np.sqrt(np.sum(b**2)))
def eucl_sim(a ,b):
a = np.array(a)
b = np.array(b)
return 1 / (1 + np.sqrt((np.sum(a - b)**2)))
def pearson_sim(a , b):
a = np.array(a)
b = np.array(b)
a = a - np.average(a)
b = b - np.average(b)
return np.sum(a * b) / (np.sqrt(np.sum(a**2)) * np.sqrt(np.sum(b**2)))
(1)BM25
BM25简单介绍:
bm25 是一种用来评价搜索词和文档之间相关性的算法,它是一种基于概率检索模型提出的算法,再用简单的话来描述下bm25算法:我们有一个query和一批文档Ds,现在要计算query和每篇文档D之间的相关性分数,我们的做法是,先对query进行切分,得到单词
q
i
q_i
qi,然后单词的分数由3部分组成:
- 单词 q i q_i qi和D之间的相关性
- 单词 q i q_i qi和D之间的相关性
- 每个单词的权重
最后对于每个单词的分数我们做一个求和,就得到了query和文档之间的分数。
#%%
import math
import sys
from collections import Counter
import os
import csv
# %%
import jieba
import jieba.posseg as pseg
import numpy as np
import pandas as pd
import joblib
from config import root_path
# %%
class BM25(object):
def __init__(self, do_train=True , save_path=os.path.join(root_path, 'model/ranking/')):
if do_train:
self.data = pd.read_csv(os.path.join(root_path , 'data/ranking/train.tsv'), sep='\t', header=None,
quoting=csv.QUOTE_NONE, names=['question1', 'question2', 'target'])
self.idf, self.avgdl = self.get_idf()
self.saver(save_path)
else:
self.stopwords = self.load_stop_word()
self.load(save_path)
def load_stop_word(self):
stop_words = os.path.join(root_path, 'data/stopwords.txt')
stopwords = open(stop_words , 'r' , encoding='utf-8').readlines()
stopwords = [w.strip() for w in stop_words]
return stopwords
def tf(self , word, count):
return count[word] / sum(count.values())
def n_containing(self , word , count_list):
return sum(1 for count in count_list if word in count)
def cal_idf(self , word , count_list):
return math.log(len(count_list)) / (1 + self.n_containing(word , count_list))
def get_idf(self):
self.data['question2'] = self.data['question2'].apply(lambda x: " ".join(jieba.cut(x)))
idf = Counter([y for x in self.data['question2'].tolist() for y in x.split()])
idf = {k: self.cal_idf(k, self.data['question2'].tolist()) for k, v in idf.items()}
avgdl = np.array([len(x.split()) for x in self.data['question2'].tolist()]).mean()
return idf, avgdl
def saver(self , save_path):
joblib.dump(self.idf , save_path + 'bm25_idf.bin')
joblib.dump(self.avgdl , save_path + 'bm25_avgdl.bin')
def load(self , save_path):
self.idf = joblib.load(save_path + 'bm25_idf.bin')
self.avgdl = joblib.load(save_path + 'bm25_avgdl.bin')
def bm_25(self , q , d , k1=1.2 , k2=200 , b=0.75):
stop_flag = ['x', 'c', 'u', 'd', 'p', 't', 'uj', 'm', 'f', 'r']
words = pseg.cut(q) # 切分查询式
fi = {}
qfi = {}
for word, flag in words:
if flag not in stop_flag and word not in self.stopwords:
fi[word] = d.count(word)
qfi[word] = q.count(word)
K = k1 * (1 - b + b * (len(d) / self.avgdl)) # 计算K值
ri = {}
for key in fi:
ri[key] = fi[key] * (k1+1) * qfi[key] * (k2+1) / ((fi[key] + K) * (qfi[key] + k2)) # 计算R
score = 0
for key in ri:
score += self.idf.get(key, 20.0) * ri[key]
return score
#%%
if __name__ == '__main__':
bm25 = BM25(do_train = True)
# %%
(2)深度匹配
其实就是个二分类,相似label为1,不相似label为0。
这里直接用bert来做
数据格式如:
model
class BertModelTrain(nn.Module):
def __init__(self):
super(BertModelTrain, self).__init__()
self.bert = BertForSequenceClassification.from_pretrained(
os.path.join(root_path, 'lib/bert/'), num_labels=2)
self.device = torch.device("cuda") if is_cuda else torch.device("cpu")
for param in self.bert.parameters():
param.requires_grad = True
def forward(self, batch_seqs, batch_seq_masks, batch_seq_segments, labels):
outputs = self.bert(input_ids=batch_seqs,
attention_mask=batch_seq_masks,
token_type_ids=batch_seq_segments,
labels=labels)
loss = outputs[0]
logits = outputs[1]
probabilities = nn.functional.softmax(logits, dim=-1)
return loss, logits, probabilities
输出为两句话相似性分数
将以上人工构建的相似度特征放入lightgbm中训练,输出精排结果
#%%
import sys
import os
import csv
import logging
import lightgbm as lgb
import pandas as pd
import joblib
from tqdm import tqdm
from config import root_path
from matchnn import MatchingNN
from similarity import TextSimilarity
from hnsw_faiss import wam
from sklearn.model_selection import train_test_split
import numpy as np
# %%
tqdm.pandas()
# %%
params = {'boosting_type': 'gbdt',
'max_depth': 5,
'objective': 'binary',
'nthread': 3,
'num_leaves': 64,
'learning_rate': 0.05,
'max_bin': 512,
'subsample_for_bin': 200,
'subsample': 0.5,
'subsample_freq': 5,
'colsample_bytree': 0.8,
'reg_alpha': 5,
'reg_lambda': 10,
'min_split_gain': 0.5,
'min_child_weight': 1,
'min_child_samples': 5,
'scale_pos_weight': 1,
'max_position': 20,
'group': 'name:groupId',
'metric': 'auc'}
# %%
class RANK(object):
def __init__(self , do_train = True, model_path= os.path.join(root_path, 'model/ranking/lightgbm')):
self.ts = TextSimilarity()
self.matchingNN = MatchingNN()
if do_train:
logging.info('Training mode')
self.train = pd.read_csv(
os.path.join(root_path, 'data/ranking/train.tsv'),
delimiter="\t",
encoding="utf-8"
)
self.data = self.generate_feature(self.train)
self.columns = [i for i in self.train.columns if 'question' not in i]
self.trainer()
self.save(model_path)
else:
logging.info('Predicting mode')
self.test = pd.read_csv(
os.path.join(root_path, 'data/ranking/test.tsv'),
delimiter="\t",
encoding="utf-8"
)
# self.testdata = self.generate_feature(self.test)
self.gbm = joblib.load(model_path)
# self.predict(self.testdata)
def generate_feature(self, data):
logging.info('Generating manual features.')
data = pd.concat([data, pd.DataFrame.from_records(data.apply(lambda row: self.ts.generate_all(row['question1'] , row['question2']), axis=1))], axis=1)
logging.info('Generating deeep-matching features.')
data['matching_score'] = data.apply(lambda row: self.matchingNN.predict(row['question1'] , row['question2'])[1] , axis=1)
return data
def trainer(self):
logging.info('Training lightgbm model.')
self.gbm = lgb.LGBMRanker(**params)
columns = [i for i in self.data.columns if i not in ['question1', 'question2' , 'target']]
X_train , X_test , y_train , y_test = train_test_split(self.data[columns] , self.data['target'] , test_size = 0.3 , random_state = 42)
query_train = [X_train.shape[0]]
query_val = [X_test.shape[0]]
self.gbm.fit(X_train , y_train , group=query_train , eval_set=[(X_test , y_test)] , eval_group=[query_val] , eval_at=[5 , 10 , 20] , early_stopping_rounds=50)
def save(self, model_path):
logging.info('Saving lightgbm model.')
joblib.dump(self.gbm, model_path)
def predict(self , data: pd.DataFrame):
columns = [i for i in data.columns if i not in ['question1' , 'question2' , 'target']]
result = self.gbm.predict(data[columns])
return result
if __name__ == '__main__':
rank = RANK(do_train=False)
# -*- coding: utf-8 -*-
"""
Created on Tue Jan 26 10:03:50 2021
@author: Sean
"""
#%%
import os
from business import Intention
from hnsw_faiss import HNSW
from ranker import RANK
import config
import pandas as pd
#%%
it = Intention(config.train_path,
config.ware_path,
model_path = config.ft_path,
kw_path= config.keyword_path
)
hnsw = HNSW(config.w2v_path,
config.train_path,
config.ef_construction,
config.M,
config.hnsw_path
)
#%%
import joblib
import ranker
model_path= os.path.join(config.root_path, 'model/ranking/lightgbm')
gbm = joblib.load(model_path)
#%%
query = '请问这电脑厚度是多少'
label,score = it.predict(query)
res = pd.DataFrame()
if len(query) > 1 and '__label__1' in label:
res = res.append(pd.DataFrame({'query': [query]*5 ,'retrieved': hnsw.search(query, 5)['custom'] , 'retr_assistance': hnsw.search(query, 5)['assistance']}))
#%%
ranked = pd.DataFrame()
#%%
ranked['question1'] = res['query']
ranked['question2'] = res['retrieved']
ranked['answer'] = res['retr_assistance']
#%%
from similarity import TextSimilarity
ts = TextSimilarity()
data = ranked
data = pd.concat([data, pd.DataFrame.from_records(data.apply(lambda row: ts.generate_all(row['question1'] , row['question2']), axis=1))], axis=1)
#%%
from matchnn import MatchingNN
matchingNN = MatchingNN()
data['matching_score'] = data.apply(lambda row: matchingNN.predict(row['question1'] , row['question2'])[1] , axis=1)
data.to_csv('result/qa_result.csv', index=False)
#%%
'''
以上代码在服务器上运行,取出qa_result.csv
'''
#%%
'''
精排结果
结合了多种相似度计算方法
lcs、edit_dist、jaccard、bm25、w2v_cos、w2v_eucl、w2v_pearson、w2v_wmd、fast_cos、fast_eucl、fast_pearson、fast_wmd、tfidf_cos、tfidf_eucl、tfidf_pearson
'''
import pandas as pd
import ranker
qa_result = pd.read_csv('result/qa_result (3).csv')
columns = [i for i in qa_result.columns if i not in ['question1' , 'question2' , 'target', 'answer']]
rank_scores = gbm.predict(qa_result[columns])
qa_result['rank_score'] = rank_scores
qa_result.to_csv('result/result.csv', index=False)
#%%
result = qa_result['rank_score'].sort_values(ascending=False)
#%%
print(qa_result['answer'].iloc[result.index[0]])
三、总结
目前只完成了业务型问答部分,闲聊部分还未全部完成;
上述所有代码已上传至github,链接:
FAQ-question-answer-system