中文短语相似度方案&实验

最新推荐文章于 2024-09-23 15:48:26 发布

Yonggie

最新推荐文章于 2024-09-23 15:48:26 发布

阅读量957

点赞数

分类专栏：小功能实现（杂）文章标签： NLP 自然语言处理

本文链接：https://blog.csdn.net/Yonggie/article/details/127261289

版权

小功能实现（杂）专栏收录该内容

29 篇文章

订阅专栏

背景

最近要做一个文本到graph的数据集，里面有不少不同字但同义的小文本段，想简单用文本相似度的方式把他们归到一类里面。

方式

直接调用hanlp的sentence sim，帮忙解决
直接用hugging face的transformer的高级api，也是直接端到端帮忙解决。
用中文transformer做sentence embedding，后用距离函数设定阈值解决。

Hanlp

根据这里，pip install hanlp后，直接

import hanlp
sts = hanlp.load(hanlp.pretrained.sts.STS_ELECTRA_BASE_ZH)

sentence_list=make_sentence_list()
# sentence_list=[
#     ('看图猜一电影名', '看图猜电影'),
#     ('无线路由器怎么无线上网', '无线上网卡和无线路由器怎么用'),
#     ('北京到上海的动车票', '上海到北京的动车票'),
# ]
res=sts(sentence_list)

print(res)

transformer end2end

pip install transformer后，
本来想偷懒直接用pipeline的，

from transformers import AutoModelForTokenClassification,AutoTokenizer,pipeline
model_id='uer/sbert-base-chinese-nli'
scorer=pipeline("sentence-similarity",model=model_id)

但是报错，嘻嘻：
"Unknown task sentence-similarity, available tasks are ['audio-classification', 'automatic-speech-recognition', 'conversational', 'document-question-answering', 'feature-extraction', 'fill-mask', 'image-classification', 'image-segmentation', 'image-to-text', 'ner', 'object-detection', 'question-answering', 'sentiment-analysis', 'summarization', 'table-question-answering', 'text-classification', 'text-generation', 'text2text-generation', 'token-classification', 'translation', 'visual-question-answering', 'vqa', 'zero-shot-classification', 'zero-shot-image-classification', 'translation_XX_to_YY']"
行吧，再偷懒下，看看能不能autotokenizer和automodel解决

根据我的这篇教程，找到对应任务后选择model id。
找到后，只有两个模型……好家伙，够少的。怎么连个example也没有……
在这里插入图片描述
随便选择一个，然后看下其他语言的sentence similarity的example：

import torch
from scipy.spatial.distance import cosine
from transformers import AutoModel, AutoTokenizer

# Load the model
tokenizer = AutoTokenizer.from_pretrained("johngiorgi/declutr-small")
model = AutoModel.from_pretrained("johngiorgi/declutr-small")

# Prepare some text to embed
text = [
    "A smiling costumed woman is holding an umbrella.",
    "A happy woman in a fairy costume holds an umbrella.",
]
inputs = tokenizer(text, padding=True, truncation=True, return_tensors="pt")

# Embed the text
with torch.no_grad():
    sequence_output = model(**inputs)[0]

# Mean pool the token-level embeddings to get sentence-level embeddings
embeddings = torch.sum(
    sequence_output * inputs["attention_mask"].unsqueeze(-1), dim=1
) / torch.clamp(torch.sum(inputs["attention_mask"], dim=1, keepdims=True), min=1e-9)

# Compute a semantic similarity via the cosine distance
semantic_sim = 1 - cosine(embeddings[0], embeddings[1])