Keybert使用方法

Shy960418

已于 2024-01-03 09:09:00 修改

阅读量1k

点赞数

分类专栏：深度学习文章标签：人工智能深度学习机器学习

于 2023-05-09 10:13:30 首次发布

本文链接：https://blog.csdn.net/m0_37134868/article/details/130574032

版权

深度学习专栏收录该内容

33 篇文章

订阅专栏

KeyBERT是一种最小且易于使用的关键字提取技术，它利用BERT嵌入来创建与文档最相似的关键字和关键字短语。

1、安装

可以使用 pypi 完成安装：

pip install keybert

2、用法

from keybert import KeyBERT

doc = """
         Supervised learning is the machine learning task of learning a function that
         maps an input to an output based on example input-output pairs. It infers a
         function from labeled training data consisting of a set of training examples.
         In supervised learning, each example is a pair consisting of an input object
         (typically a vector) and a desired output value (also called the supervisory signal).
         A supervised learning algorithm analyzes the training data and produces an inferred function,
         which can be used for mapping new examples. An optimal scenario will allow for the
         algorithm to correctly determine the class labels for unseen instances. This requires
         the learning algorithm to generalize from the training data to unseen situations in a
         'reasonable' way (see inductive bias).
      """
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(doc)

设置设置生成的关键字/关键字短语的长度：keyphrase_ngram_range

>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 1), stop_words=None)
[('learning', 0.4604),
 ('algorithm', 0.4556),
 ('training', 0.4487),
 ('class', 0.4086),
 ('mapping', 0.3700)]

要提取关键短语，只需根据数字设置为（1， 2）或更高您希望在结果的关键短语中的单词：keyphrase_ngram_range

>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 2), stop_words=None)
[('learning algorithm', 0.6978),
 ('machine learning', 0.6305),
 ('supervised learning', 0.5985),
 ('algorithm analyzes', 0.5860),
 ('learning function', 0.5850)]

通过简单地设置来突出显示文档中的关键字：highlight

keywords = kw_model.extract_keywords(doc, highlight=True)

3、最大总和距离

为了使结果多样化，我们将 2 x top_n最相似的单词/短语用于文档。然后，我们从 2 x top_n 单词中提取所有top_n组合并提取组合通过余弦相似性彼此最不相似。

>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
                              use_maxsum=True, nr_candidates=20, top_n=5)
[('set training examples', 0.7504),
 ('generalize training data', 0.7727),
 ('requires learning algorithm', 0.5050),
 ('supervised learning algorithm', 0.3779),
 ('learning machine learning', 0.2891)]

4、最大边际相关性

为了使结果多样化，可以使用最大边距相关性（MMR）来创建关键字/关键短语，也基于余弦相似性。结果具有高度多样性：

>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
                              use_mmr=True, diversity=0.7)
[('algorithm generalize training', 0.7727),
 ('labels unseen instances', 0.1649),
 ('new examples optimal', 0.4185),
 ('determine class labels', 0.4774),
 ('supervised learning algorithm', 0.7502)]

低多样性的结果：

>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
                              use_mmr=True, diversity=0.2)
[('algorithm generalize training', 0.7727),
 ('supervised learning algorithm', 0.7502),
 ('learning machine learning', 0.7577),
 ('learning algorithm analyzes', 0.7587),
 ('learning algorithm generalize', 0.7514)]

5、嵌入模型

KeyBERT支持许多可用于嵌入文档和单词的嵌入模型：

Sentence-Transformers
Flair
Spacy
Gensim
USE
可以去huggingface官网查看模型，并使用以下命令通过KeyBERT传递：sentence-transformersmodel

from keybert import KeyBERT
kw_model = KeyBERT(model='all-MiniLM-L6-v2')

或者使用您自己的参数选择句子转换器模型：

from keybert import KeyBERT
from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
kw_model = KeyBERT(model=sentence_model)