（四）基于文本的QA问答系统——biencoder方法

Simonsdu

已于 2022-04-09 16:02:59 修改

阅读量1.2k

点赞数

分类专栏：基于文本的QA问答系统文章标签： python nlp 自然语言处理

于 2022-04-09 15:15:06 首次发布

本文链接：https://blog.csdn.net/Simonsdu/article/details/124061352

版权

基于文本的QA问答系统专栏收录该内容

9 篇文章 4 订阅

订阅专栏

biencoder方法

文章目录

biencoder方法

加载预训练模型

我们使用Muennighoff/SGPT-125M-weightedmean-msmarco-specb-bitfit预训练模型。

tokenizer = AutoTokenizer.from_pretrained("Muennighoff/SGPT-125M-weightedmean-msmarco-specb-bitfit",cache_dir = './SGPT-125M-weightedmean-msmarco-specb-bitfit')
model = AutoModel.from_pretrained("Muennighoff/SGPT-125M-weightedmean-msmarco-specb-bitfit",cache_dir = './SGPT-125M-weightedmean-msmarco-specb-bitfit')

获得query、doc的初始、结束标识符编码

SPECB_QUE_BOS = tokenizer.encode("[", add_special_tokens=False)[0]
SPECB_QUE_EOS = tokenizer.encode("]", add_special_tokens=False)[0]

SPECB_DOC_BOS = tokenizer.encode("{", add_special_tokens=False)[0]
SPECB_DOC_EOS = tokenizer.encode("}", add_special_tokens=False)[0]

分词并编码

def tokenize_with_specb(texts,is_query):
    # Tokenize without padding
    batch_tokens = tokenizer(texts, padding=False, truncation=True)
    # Add special brackets & pay attention to them
    for seq, att in zip(batch_tokens["input_ids"], batch_tokens["attention_mask"]):
        if is_query:
            seq.insert(0, SPECB_QUE_BOS)
            seq.append(SPECB_QUE_EOS)
        else:
            seq.insert(0, SPECB_DOC_BOS)
            seq.append(SPECB_DOC_EOS)
        att.insert(0, 1)
        att.append(1)
    # Add padding
    batch_tokens = tokenizer.pad(batch_tokens, padding=True, return_tensors="pt")
    return batch_tokens

获得词嵌入结果

def get_weightedmean_embedding(batch_tokens, model):
    # Get the embeddings
    with torch.no_grad():
        # Get hidden state of shape [bs, seq_len, hid_dim]
        last_hidden_state = model(**batch_tokens, output_hidden_states=True, return_dict=True).last_hidden_state

    # Get weights of shape [bs, seq_len, hid_dim]
    weights = (
        torch.arange(start=1, end=last_hidden_state.shape[1] + 1)
        .unsqueeze(0)
        .unsqueeze(-1)
        .expand(last_hidden_state.size())
        .float().to(last_hidden_state.device)
    )

    # Get attn mask of shape [bs, seq_len, hid_dim]
    input_mask_expanded = (
        batch_tokens["attention_mask"]
        .unsqueeze(-1)
        .expand(last_hidden_state.size())
        .float()
    )

    # Perform weighted mean pooling across seq_len: bs, seq_len, hidden_dim -> bs, hidden_dim
    sum_embeddings = torch.sum(last_hidden_state * input_mask_expanded * weights, dim=1)
    sum_mask = torch.sum(input_mask_expanded * weights, dim=1)

    embeddings = sum_embeddings / sum_mask

    return embeddings

计算相似度

query_embeddings = get_weightedmean_embedding(tokenize_with_specb(queries, is_query=True), model)
doc_embeddings = get_weightedmean_embedding(tokenize_with_specb(docs, is_query=False), model)

# Calculate cosine similarities
# Cosine similarities are in [-1, 1]. Higher means more similar
cosine_sim_0_1 = 1 - cosine(query_embeddings[0], doc_embeddings[0])
cosine_sim_0_2 = 1 - cosine(query_embeddings[0], doc_embeddings[1])
cosine_sim_0_3 = 1 - cosine(query_embeddings[0], doc_embeddings[2])

小结

分词、词嵌入、计算向量相似度，这是一种很朴素的方法，在大多数简单问题上性能表现优良。但是考虑到由于独立进行编码，因此被模型重点考虑的是句子的语义相似度而不是逻辑关系。

Simonsdu

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
（四）基于文本的QA问答系统——biencoder方法

biencoder方法加载预训练模型我们使用Muennighoff/SGPT-125M-weightedmean-msmarco-specb-bitfit预训练模型。tokenizer = AutoTokenizer.from_pretrained("Muennighoff/SGPT-125M-weightedmean-msmarco-specb-bitfit",cache_dir = './SGPT-125M-weightedmean-msmarco-specb-bitfit')model = A
复制链接

扫一扫

专栏目录