FAQ式问答系统

最新推荐文章于 2024-11-05 14:53:42 发布

ToTensor

最新推荐文章于 2024-11-05 14:53:42 发布

阅读量4.3k

点赞数 8

分类专栏： NLP实战文章标签：自然语言处理 pytorch 深度学习

本文链接：https://blog.csdn.net/qq_44193969/article/details/116128473

版权

NLP实战专栏收录该内容

10 篇文章

订阅专栏

FAQ式问答系统

最终效果
系统架构
项目描述
一、意图识别
二、检索模型
三、总结

最终效果

在这里插入图片描述

系统架构

在这里插入图片描述

项目描述

对话系统（Dialogue Systems）又可以称之为聊天机器人（ChatBot），主要是实现自动与用户进行对话的功能。帮助用户完成某些具体的任务（下单、打车、订座等）的对话系统可以称之为任务导向型（Task-oriented）的对话系统；解答用户的某些问题（询问天气、股价、交通等）的对话系统可以称之为问答型（QA-based）的对话系统；除此之外，还有和用户聊天的闲聊型（Chatting) 对话系统。大多数的对话系统都是混合了几种类型的功能。

对话系统中语言的生成主要可以分为两种方式：检索式（Retrieval）和生成式（Generative）。检索式方法我们一般会构建一个语料库，为 FAQ 存放query-response pairs，然后用户发起一个新的 query 时，我们去匹配为这个query 检索最佳的 response。上述过程一般又可以分为召回（Retrieve）和排序（ranking）两个部分：召回即通过 query 找到语料库中最相似的几十个或几百个 query，大大减小候选Response 的数量。这一部分我们一般采用轻量级的方法，如倒排索引（Inverted Index）和近似近邻搜索（ApproximateNearest Neighbor Search）等进行快速检索。排序则是对召回的结果进行进一步的筛选，可以构建更复杂的特征，使用机器学习或深度学习的方法来进行排序。

一、意图识别

这个地方我们做成一个简单的文本二分类任务，根据用户的开场白识别用户的意图是业务需求还是闲聊。若判断为业务，则输入到检索模型，若为闲聊。则输入到聊天生成模型

1.fasttext介绍

fastText是Facebook于2016年开源的一个词向量计算和文本分类工具，在学术上并没有太大创新。但是它的优点也非常明显，在文本分类任务中，fastText（浅层网络）往往能取得和深度网络相媲美的精度，却在训练时间上比深度网络快许多数量级。在标准的多核CPU上，能够训练10亿词级别语料库的词向量在10分钟之内，能够分类有着30万多类别的50多万句子在1分钟之内。

fasttext模型结构

在这里插入图片描述
注意：此架构图没有展示词向量的训练过程。可以看到，和CBOW一样，fastText模型也只有三层：输入层、隐含层、输出层（Hierarchical Softmax），输入都是多个经向量表示的单词，输出都是一个特定的target，隐含层都是对多个词向量的叠加平均。不同的是，CBOW的输入是目标单词的上下文，fastText的输入是多个单词及其n-gram特征，这些特征用来表示单个文档；CBOW的输入单词被onehot编码过，fastText的输入特征是被embedding过；CBOW的输出是目标词汇，fastText的输出是文档对应的类标。

值得注意的是，fastText在输入时，将单词的字符级别的n-gram向量作为额外的特征。
常用的特征是词袋模型（将输入数据转化为对应的Bow形式）。但词袋模型不能考虑词之间的顺序，因此 fastText 还加入了 N-gram 特征。

“我爱她” 这句话中的词袋模型特征是 “我”，“爱”, “她”。这些特征和句子 “她爱我” 的特征是一样的。
如果加入 2-Ngram，第一句话的特征还有 “我-爱” 和 “爱-她”，这两句话 “我爱她” 和 “她爱我” 就能区别开来了。当然，为了提高效率，我们需要过滤掉低频的 N-gram。

fastText的核心思想就是：将整篇文档的词及n-gram向量叠加平均得到文档向量，然后使用文档向量做softmax多分类。

2.数据格式

train_no_blank.csv
在这里插入图片描述
ware.txt

ware.txt用于生成key_word.txt
通过判断train_no_blank.csv中的句子是否包含key_word.txt中的内容去打标注，若包含则标注为__label__1（为什么是__label__1而不是1呢？因为我们使用fasttext去做意图识别，fasttext中要求的数据格式为__label__1或__label__0），否则标注为__label__0，__label__1为业务，__label__0为闲聊。
最终的训练数据格式为：
在这里插入图片描述

3.Code

#%%
import logging
import sys
import os

import fasttext
import jieba.posseg as pseg
import pandas as pd
from tqdm import tqdm
import config
from config import root_path
from preprocessor import clean, filter_content
# %%

# %%
tqdm.pandas()
# %%
logging.basicConfig(format="%(levelname)s - %(asctime)s: %(message)s",
                    datefmt="%H:%M:%S",
                    level=logging.INFO)
# %%
class Intention(object):
    def __init__(self,
                 data_path=config.train_path,  
                 sku_path=config.ware_path,  
                 model_path=None,  
                 kw_path=None,  
                 model_train_file=config.business_train,  
                 model_test_file=config.business_test): 
        self.model_path = model_path
        self.data = pd.read_csv(data_path , encoding='utf-8')

        if model_path and os.path.exists(model_path):
            self.fast = fasttext.load_model(model_path)
        else:
            self.kw = self.build_keyword(sku_path, to_file=kw_path)
            self.data_process(model_train_file)  # Create
            self.fast = self.train(model_train_file, model_test_file)

    def build_keyword(self, sku_path, to_file):
    

        logging.info('Building keywords.')
        tokens = []
    

        tokens = self.data['custom'].dropna().apply(
            lambda x: [
                token for token, pos in pseg.cut(x) if pos in ['n', 'vn', 'nz']
                ])

        key_words = set(
            [tk for idx, sample in tokens.iteritems()
                for tk in sample if len(tk) > 1])
        logging.info('Key words built.')
        sku = []
        with open(sku_path, 'r' , encoding='utf-8') as f:
            next(f)
            for lines in f:
                line = lines.strip().split('\t')
                sku.extend(line[-1].split('/'))
        key_words |= set(sku)
        logging.info('Sku words merged.')
        if to_file is not None:
            with open(to_file, 'w' , encoding='utf-8') as f:
                for i in key_words:
                    f.write(i + '\n')
        return key_words

    def data_process(self, model_data_file):
  
        logging.info('Processing data.')
        self.data['is_business'] = self.data['custom'].progress_apply(
            lambda x: 1 if any(kw in x for kw in self.kw) else 0)
        with open(model_data_file, 'w' , encoding='utf-8') as f:
            for index, row in tqdm(self.data.iterrows(),
                                    total=self.data.shape[0]):
                outline = clean(row['custom']) + "\t__label__" + str(
                    int(row['is_business'])) + "\n"
                f.write(outline)
    def train(self, model_data_file, model_test_file):

        logging.info('Training classifier.')
        classifier = fasttext.train_supervised(model_data_file,
                                               label="__label__",
                                               dim=100,
                                               epoch=5,
                                               lr=0.1,
                                               wordNgrams=2,
                                               loss='softmax',
                                               thread=5,
                                               verbose=True)
        self.test(classifier, model_test_file)
        classifier.save_model(self.model_path)
        logging.info('Model saved.')
        return classifier

    def test(self, classifier, model_test_file):
  
        logging.info('Testing trained model.')
        test = pd.read_csv(config.test_path).fillna('')
        test['is_business'] = test['custom'].progress_apply(
            lambda x: 1 if any(kw in x for kw in self.kw) else 0)

        with open(model_test_file, 'w' , encoding='utf-8') as f:
            for index, row in tqdm(test.iterrows(), total=test.shape[0]):
                outline = clean(row['custom']) + "\t__label__" + str(
                    int(row['is_business'])) + "\n"
                f.write(outline)
        result = classifier.test(model_test_file)
        # F1 score
        print(result[1] * result[2] * 2 / (result[2] + result[1]))

    def predict(self, text):
    
        logging.info('Predicting.')
        label, score = self.fast.predict(clean(filter_content(text)))
        return label, score

#%%
if __name__ == "__main__":
    it = Intention(config.train_path,
                 config.ware_path,
                 model_path=config.ft_path,
                 kw_path=config.keyword_path)
    print(it.predict('你最近怎么样'))
    print(it.predict('你好手机多少钱'))

output:=
在这里插入图片描述

二、检索模型

我们将用 Approximate Nearest Neighbor Search(ANNS) 方法中较为常用的 Hierarchical Navigable Small World(HNSW) 来做召回
的部分；然后将构建各种相似度特征（包括深度匹配网络），并利用LightGBM 来训练一个 Learning2Rank 模型。
想了解HNSW的原理的同学可以看这篇文章HNSW原理

1.预处理

因为hnsw不能直接存文本，因此我们需要将其转换为向量，这里我们使用每句话的词向量的平均值表示句向量，计算过程如下:

def wam(sentence, w2v_model):
    arr = []
    for s in clean(sentence).split():
        if s not in w2v_model.wv.vocab.keys():
            arr.append(np.random.randn(1, 300))#如果词向量不在词向量模型中，则随机初始化一个300维的向量
        else:
            arr.append(w2v_model.wv.get_vector(s))#从词向量模型中取词向量
    return np.mean(np.array(arr), axis=0).reshape(1, -1)

2.构建hnsw图

class HNSW(object):
    def __init__(self,
                 w2v_path,
                 data_path=None,
                 ef=config.ef_construction,
                 M=config.M,
                 model_path=config.hnsw_path):
        self.w2v_model = KeyedVectors.load(w2v_path)

        self.data = self.data_load(data_path)
        if model_path and os.path.exists(model_path):
            # 加载
            self.hnsw = self.load_hnsw(model_path)
        else:
            # 训练
            self.hnsw = \
                self.build_hnsw(os.path.join(config.root_path, 'model/retrieval/hnsw.bin'),
                                ef=ef,
                                m=M)

    def data_load(self, data_path):
        
        data = pd.read_csv(
            data_path)
        data['custom_vec'] = data['custom'].apply(
            lambda x: wam(x, self.w2v_model))
        data['custom_vec'] = data['custom_vec'].apply(
            lambda x: x[0][0] if x.shape[1] != 300 else x)
        data = data.dropna()
        return data

    def build_hnsw(self, to_file, ef=2000, m=64):
      
        logging.info('build_hnsw')
        dim = self.w2v_model.vector_size
        num_elements = self.data['custom'].shape[0]
        hnsw = np.stack(self.data['custom_vec'].values).reshape(-1, 300)

       
        p = hnswlib.Index(space='l2',
                          dim=dim)  # possible options are l2, cosine or ip
        p.init_index(max_elements=num_elements, ef_construction=ef, M=m)
        p.set_ef(10)
        p.set_num_threads(8)
        p.add_items(hnsw)
        logging.info('Start')
        labels, distances = p.knn_query(hnsw, k=1)
        print('labels: ', labels)
        print('distances: ', distances)
        logging.info("Recall:{}".format(
            np.mean(labels.reshape(-1) == np.arange(len(hnsw)))))
        p.save_index(to_file)
        return p

    def load_hnsw(self, model_path):
        
        hnsw = hnswlib.Index(space='l2', dim=self.w2v_model.vector_size)
        hnsw.load_index(model_path)
        return hnsw

    def search(self, text, k=5):
      
        test_vec = wam(clean(text), self.w2v_model)
        q_labels, q_distances = self.hnsw.knn_query(test_vec, k=k)
        return pd.concat(
            (self.data.iloc[q_labels[0]]['custom'].reset_index(),
             self.data.iloc[q_labels[0]]['assistance'].reset_index(drop=True),
             pd.DataFrame(q_distances.reshape(-1, 1), columns=['q_distance'])),
            axis=1)
if __name__ == "__main__":
    hnsw = HNSW(config.w2v_path,
                config.train_path,
                config.ef_construction,
                config.M,
                config.hnsw_path
                )

    test = '在手机上下载'
    result = hnsw.search(test, k=10)

在这里插入图片描述

重要参数space
在这里插入图片描述

至此，召回模型已完成，接下来，将召回模型输出的结果作为输入，通过learning to rank去输出最终结果

3.Learning To Rank

这一步我们需要构建多种相似度特征，主要可以分为几类：

基于字符串距离的（编辑距离、列文斯坦距离、LCS）；
基于向量距离的（cosine、Euclidian、Jaccard、WMD）；
基于统计量的（BM25、Pearson Correlation）；
基于深度匹配模型的。构建完特征后，我们使用 LightGBM 来训练一个 Learning To Rank模型

#LCS
def lcs(self , str_a , str_b):
    lengths = [[0 for j in range(len(str_b) + 1 )]
                for i in range(len(str_a) + 1)]
    for i,x in enumerate(str_a):
        for j,y in enumerate(str_b):
            if x==y:
                lengths[i+1][j+1] = lengths[i][j] + 1
            else:
                lengths[i+1][j+1] = max(lengths[i+1][j] , lengths[i][j+1])
    
    result = ""
    x,y = len(str_a) , len(str_b)
    while x !=0 and y !=0:
        if lengths[x][y] == lengths[x - 1][y]:
            x -= 1
        elif lengths[x][y] == lengths[x][y-1]:
            y -= 1
        else:
            assert str_a[x-1] == str_b[y-1]
            result = str_a[x-1] + result
            x -= 1
            y -= 1
    
    longestdist = lengths[len(str_a)][len(str_b)]
    ratio = longestdist / min(len(str_a) , len(str_b))
    return ratio

def editDistance(self , str1 , str2):
       m = len(str1)
       n = len(str2)
       lensum = float(m + n)
       d = [[0] * (n+1) for _ in range(m+1)]
       for i in range(m+1):
           d[i][0] = i
       for j in range(n+1):
           d[0][j] = j
       
       for j in range(1 , n+1):
           for i in range(1 , m+1):
               if str1[i -1] == str2[j -1]:
                   d[i][j] = d[i-1][j-1]
               else:
                   d[i][j] = min(d[i-1][j] , d[i][j-1] , d[i-1][j-1]) + 1
       dist = d[-1][-1]
       ratio = (lensum -dist) / lensum
       return ratio

def JaccardSim(self , str_a , str_b):
    seta = self.tokenize(str_a)[1]
    setb = self.tokenize(str_b)[1]
    sa_sb = 1.0 * len(seta & setb) / len(seta | setb)
    return sa_sb

def cos_sim(a ,b):
    a = np.array(a)
    b = np.array(b)
    return np.sum(a * b) / (np.sqrt(np.sum(a**2)) * np.sqrt(np.sum(b**2)))


def eucl_sim(a ,b):
    a = np.array(a)
    b = np.array(b)
    return 1 / (1 + np.sqrt((np.sum(a - b)**2)))

def pearson_sim(a , b):
    a = np.array(a)
    b = np.array(b)

    a = a - np.average(a)
    b = b - np.average(b)
    return np.sum(a * b) / (np.sqrt(np.sum(a**2)) * np.sqrt(np.sum(b**2)))

（1）BM25

BM25简单介绍：

在这里插入图片描述
bm25 是一种用来评价搜索词和文档之间相关性的算法，它是一种基于概率检索模型提出的算法，再用简单的话来描述下bm25算法：我们有一个query和一批文档Ds，现在要计算query和每篇文档D之间的相关性分数，我们的做法是，先对query进行切分，得到单词 $q_i$ ，然后单词的分数由3部分组成：

单词 $q_i$ 和D之间的相关性
单词 $q_i$ 和D之间的相关性
每个单词的权重
最后对于每个单词的分数我们做一个求和，就得到了query和文档之间的分数。

#%%
import math
import sys
from collections import Counter
import os
import csv

# %%
import jieba
import jieba.posseg as pseg
import numpy as np
import pandas as pd
import joblib
from config import root_path
# %%
class BM25(object):
    def __init__(self, do_train=True , save_path=os.path.join(root_path, 'model/ranking/')):
        if do_train:
            self.data = pd.read_csv(os.path.join(root_path , 'data/ranking/train.tsv'), sep='\t', header=None,
                                    quoting=csv.QUOTE_NONE, names=['question1', 'question2', 'target'])
            self.idf, self.avgdl = self.get_idf()
            self.saver(save_path)
        else:
            self.stopwords = self.load_stop_word()
            self.load(save_path)
    

    def load_stop_word(self):
        stop_words = os.path.join(root_path, 'data/stopwords.txt')
        stopwords = open(stop_words , 'r' , encoding='utf-8').readlines()
        stopwords = [w.strip() for w in stop_words]
        return stopwords
    
    def tf(self , word, count):
        return count[word] / sum(count.values())

    def n_containing(self , word , count_list):
        return sum(1 for count in count_list if word in count)

    def cal_idf(self , word , count_list):
        return math.log(len(count_list)) / (1 + self.n_containing(word , count_list))

    def get_idf(self):
        self.data['question2'] = self.data['question2'].apply(lambda x: " ".join(jieba.cut(x)))
        idf = Counter([y for x in self.data['question2'].tolist() for y in x.split()])
        idf = {k: self.cal_idf(k, self.data['question2'].tolist()) for k, v in idf.items()}
        avgdl = np.array([len(x.split()) for x in self.data['question2'].tolist()]).mean()
        return idf, avgdl
    
    def saver(self , save_path):
        joblib.dump(self.idf , save_path + 'bm25_idf.bin')
        joblib.dump(self.avgdl , save_path + 'bm25_avgdl.bin')
    
    def load(self , save_path):
        self.idf = joblib.load(save_path + 'bm25_idf.bin')
        self.avgdl = joblib.load(save_path + 'bm25_avgdl.bin')

    def bm_25(self , q , d , k1=1.2 , k2=200 , b=0.75):
        stop_flag = ['x', 'c', 'u', 'd', 'p', 't', 'uj', 'm', 'f', 'r']
        words = pseg.cut(q)  # 切分查询式
        fi = {}
        qfi = {}
        for word, flag in words:
            if flag not in stop_flag and word not in self.stopwords:
                fi[word] = d.count(word)
                qfi[word] = q.count(word)
        K = k1 * (1 - b + b * (len(d) / self.avgdl))  # 计算K值
        ri = {}
        for key in fi:
            ri[key] = fi[key] * (k1+1) * qfi[key] * (k2+1) / ((fi[key] + K) * (qfi[key] + k2))  # 计算R

        score = 0
        for key in ri:
            score += self.idf.get(key, 20.0) * ri[key]
        return score
#%%
if __name__ == '__main__':
    bm25 = BM25(do_train = True)
    
# %%

（2）深度匹配

其实就是个二分类，相似label为1，不相似label为0。
这里直接用bert来做
数据格式如：
在这里插入图片描述

model

class BertModelTrain(nn.Module):
    def __init__(self):
        super(BertModelTrain, self).__init__()
        self.bert = BertForSequenceClassification.from_pretrained(
            os.path.join(root_path, 'lib/bert/'), num_labels=2)
        self.device = torch.device("cuda") if is_cuda else torch.device("cpu")
        for param in self.bert.parameters():
            param.requires_grad = True  

    def forward(self, batch_seqs, batch_seq_masks, batch_seq_segments, labels):
        outputs        = self.bert(input_ids=batch_seqs,
                                 attention_mask=batch_seq_masks,
                                 token_type_ids=batch_seq_segments,
                                 labels=labels)
        loss = outputs[0]
        logits = outputs[1]
        probabilities = nn.functional.softmax(logits, dim=-1)
        return loss, logits, probabilities

输出为两句话相似性分数

将以上人工构建的相似度特征放入lightgbm中训练，输出精排结果

#%%
import sys
import os
import csv
import logging
import lightgbm as lgb
import pandas as pd
import joblib
from tqdm import tqdm
from config import root_path
from matchnn import MatchingNN
from similarity import TextSimilarity
from hnsw_faiss import wam
from sklearn.model_selection import train_test_split
import numpy as np
# %%
tqdm.pandas()
# %%
params = {'boosting_type': 'gbdt',
          'max_depth': 5,
          'objective': 'binary',
          'nthread': 3,  
          'num_leaves': 64,
          'learning_rate': 0.05,
          'max_bin': 512,
          'subsample_for_bin': 200,
          'subsample': 0.5,
          'subsample_freq': 5,
          'colsample_bytree': 0.8,
          'reg_alpha': 5,
          'reg_lambda': 10,
          'min_split_gain': 0.5,
          'min_child_weight': 1,
          'min_child_samples': 5,
          'scale_pos_weight': 1,
          'max_position': 20,
          'group': 'name:groupId',
          'metric': 'auc'}
# %%
class RANK(object):
    def __init__(self , do_train = True, model_path= os.path.join(root_path, 'model/ranking/lightgbm')):
        self.ts = TextSimilarity()
        self.matchingNN = MatchingNN()
        if do_train:
            logging.info('Training mode')
            self.train = pd.read_csv(
                os.path.join(root_path, 'data/ranking/train.tsv'),
                delimiter="\t", 
                encoding="utf-8"
            )

            self.data = self.generate_feature(self.train)
            self.columns = [i for i in self.train.columns if 'question' not in i]
            self.trainer()
            self.save(model_path)
        else:
            logging.info('Predicting mode')
            self.test = pd.read_csv(
                os.path.join(root_path, 'data/ranking/test.tsv'),
                delimiter="\t", 
                encoding="utf-8"
                )
            
#            self.testdata = self.generate_feature(self.test)
            self.gbm = joblib.load(model_path)
#            self.predict(self.testdata)

    def generate_feature(self, data):
        logging.info('Generating manual features.')
        data = pd.concat([data, pd.DataFrame.from_records(data.apply(lambda row: self.ts.generate_all(row['question1'] , row['question2']), axis=1))], axis=1)
        logging.info('Generating deeep-matching features.')
        data['matching_score'] = data.apply(lambda row: self.matchingNN.predict(row['question1'] , row['question2'])[1] , axis=1)
        return data


    def trainer(self):
        logging.info('Training lightgbm model.')
        self.gbm = lgb.LGBMRanker(**params)
        columns = [i for i in self.data.columns if i not in ['question1', 'question2' , 'target']]
        X_train , X_test , y_train , y_test = train_test_split(self.data[columns] , self.data['target'] , test_size = 0.3 , random_state = 42)
        query_train = [X_train.shape[0]]
        query_val = [X_test.shape[0]]
        self.gbm.fit(X_train , y_train , group=query_train , eval_set=[(X_test , y_test)] , eval_group=[query_val] , eval_at=[5 , 10 , 20] , early_stopping_rounds=50)
    
    def save(self, model_path):
        logging.info('Saving lightgbm model.')
        joblib.dump(self.gbm, model_path)

    def predict(self , data: pd.DataFrame):
        columns = [i for i in data.columns if i not in ['question1' , 'question2' , 'target']]
        result = self.gbm.predict(data[columns])
        return result



if __name__ == '__main__':
    rank = RANK(do_train=False)

# -*- coding: utf-8 -*-
"""
Created on Tue Jan 26 10:03:50 2021

@author: Sean
"""
#%%
import os
from business import Intention
from hnsw_faiss import HNSW
from ranker import RANK
import config
import pandas as pd
#%%
it = Intention(config.train_path,
                config.ware_path,
                model_path = config.ft_path,
                kw_path= config.keyword_path
                )
hnsw = HNSW(config.w2v_path,
            config.train_path,
            config.ef_construction,
            config.M,
            config.hnsw_path
            )

#%%
import joblib
import ranker
model_path= os.path.join(config.root_path, 'model/ranking/lightgbm')
gbm = joblib.load(model_path)
#%%
query = '请问这电脑厚度是多少' 
label,score = it.predict(query)
res = pd.DataFrame()
if len(query) > 1 and '__label__1' in label:
   res = res.append(pd.DataFrame({'query': [query]*5 ,'retrieved': hnsw.search(query, 5)['custom'] , 'retr_assistance': hnsw.search(query, 5)['assistance']}))
#%%
ranked = pd.DataFrame()
#%%
ranked['question1'] = res['query']
ranked['question2'] = res['retrieved']
ranked['answer'] = res['retr_assistance']
#%%
from similarity import TextSimilarity
ts = TextSimilarity()
data = ranked
data = pd.concat([data, pd.DataFrame.from_records(data.apply(lambda row: ts.generate_all(row['question1'] , row['question2']), axis=1))], axis=1)
#%%
from matchnn import MatchingNN
matchingNN = MatchingNN()
data['matching_score'] = data.apply(lambda row: matchingNN.predict(row['question1'] , row['question2'])[1] , axis=1)
data.to_csv('result/qa_result.csv', index=False)
#%%
'''
以上代码在服务器上运行，取出qa_result.csv
'''
#%%
'''
精排结果
结合了多种相似度计算方法
lcs、edit_dist、jaccard、bm25、w2v_cos、w2v_eucl、w2v_pearson、w2v_wmd、fast_cos、fast_eucl、fast_pearson、fast_wmd、tfidf_cos、tfidf_eucl、tfidf_pearson
'''
import pandas as pd
import ranker


qa_result = pd.read_csv('result/qa_result (3).csv')
columns = [i for i in qa_result.columns if i not in ['question1' , 'question2' , 'target', 'answer']]
rank_scores = gbm.predict(qa_result[columns])
qa_result['rank_score'] = rank_scores
qa_result.to_csv('result/result.csv', index=False)

#%%

result = qa_result['rank_score'].sort_values(ascending=False)
#%%
print(qa_result['answer'].iloc[result.index[0]])