Keybert使用方法

KeyBERT是一种最小且易于使用的关键字提取技术,它利用BERT嵌入来 创建与文档最相似的关键字和关键字短语。

1、安装

可以使用 pypi 完成安装:

pip install keybert

2、用法

from keybert import KeyBERT

doc = """
         Supervised learning is the machine learning task of learning a function that
         maps an input to an output based on example input-output pairs. It infers a
         function from labeled training data consisting of a set of training examples.
         In supervised learning, each example is a pair consisting of an input object
         (typically a vector) and a desired output value (also called the supervisory signal).
         A supervised learning algorithm analyzes the training data and produces an inferred function,
         which can be used for mapping new examples. An optimal scenario will allow for the
         algorithm to correctly determine the class labels for unseen instances. This requires
         the learning algorithm to generalize from the training data to unseen situations in a
         'reasonable' way (see inductive bias).
      """
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(doc)

设置设置生成的关键字/关键字短语的长度:keyphrase_ngram_range

>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 1), stop_words=None)
[('learning', 0.4604),
 ('algorithm', 0.4556),
 ('training', 0.4487),
 ('class', 0.4086),
 ('mapping', 0.3700)]

要提取关键短语,只需根据数字设置为 (1, 2) 或更高 您希望在结果的关键短语中的单词:keyphrase_ngram_range

>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 2), stop_words=None)
[('learning algorithm', 0.6978),
 ('machine learning', 0.6305),
 ('supervised learning', 0.5985),
 ('algorithm analyzes', 0.5860),
 ('learning function', 0.5850)]

通过简单地设置来突出显示文档中的关键字:highlight

keywords = kw_model.extract_keywords(doc, highlight=True)

3、 最大总和距离

为了使结果多样化,我们将 2 x top_n最相似的单词/短语用于文档。 然后,我们从 2 x top_n 单词中提取所有top_n组合并提取组合 通过余弦相似性彼此最不相似。

>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
                              use_maxsum=True, nr_candidates=20, top_n=5)
[('set training examples', 0.7504),
 ('generalize training data', 0.7727),
 ('requires learning algorithm', 0.5050),
 ('supervised learning algorithm', 0.3779),
 ('learning machine learning', 0.2891)]

4、最大边际相关性

为了使结果多样化,可以使用最大边距相关性(MMR)来创建 关键字/关键短语,也基于余弦相似性。结果 具有高度多样性:

>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
                              use_mmr=True, diversity=0.7)
[('algorithm generalize training', 0.7727),
 ('labels unseen instances', 0.1649),
 ('new examples optimal', 0.4185),
 ('determine class labels', 0.4774),
 ('supervised learning algorithm', 0.7502)]

低多样性的结果:

>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
                              use_mmr=True, diversity=0.2)
[('algorithm generalize training', 0.7727),
 ('supervised learning algorithm', 0.7502),
 ('learning machine learning', 0.7577),
 ('learning algorithm analyzes', 0.7587),
 ('learning algorithm generalize', 0.7514)]

5、嵌入模型

KeyBERT支持许多可用于嵌入文档和单词的嵌入模型:

  • Sentence-Transformers
  • Flair
  • Spacy
  • Gensim
  • USE
    可以去huggingface官网查看模型,并使用以下命令通过KeyBERT传递:sentence-transformersmodel
from keybert import KeyBERT
kw_model = KeyBERT(model='all-MiniLM-L6-v2')

或者使用您自己的参数选择句子转换器模型:

from keybert import KeyBERT
from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
kw_model = KeyBERT(model=sentence_model)

来源:https://github.com/MaartenGr/KeyBERT

  • 0
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值