KeyBERT是一种最小且易于使用的关键字提取技术,它利用BERT嵌入来 创建与文档最相似的关键字和关键字短语。
1、安装
可以使用 pypi 完成安装:
pip install keybert
2、用法
from keybert import KeyBERT
doc = """
Supervised learning is the machine learning task of learning a function that
maps an input to an output based on example input-output pairs. It infers a
function from labeled training data consisting of a set of training examples.
In supervised learning, each example is a pair consisting of an input object
(typically a vector) and a desired output value (also called the supervisory signal).
A supervised learning algorithm analyzes the training data and produces an inferred function,
which can be used for mapping new examples. An optimal scenario will allow for the
algorithm to correctly determine the class labels for unseen instances. This requires
the learning algorithm to generalize from the training data to unseen situations in a
'reasonable' way (see inductive bias).
"""
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(doc)
设置设置生成的关键字/关键字短语的长度:keyphrase_ngram_range
>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 1), stop_words=None)
[('learning', 0.4604),
('algorithm', 0.4556),
('training', 0.4487),
('class', 0.4086),
('mapping', 0.3700)]
要提取关键短语,只需根据数字设置为 (1, 2) 或更高 您希望在结果的关键短语中的单词:keyphrase_ngram_range
>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 2), stop_words=None)
[('learning algorithm', 0.6978),
('machine learning', 0.6305),
('supervised learning', 0.5985),
('algorithm analyzes', 0.5860),
('learning function', 0.5850)]
通过简单地设置来突出显示文档中的关键字:highlight
keywords = kw_model.extract_keywords(doc, highlight=True)
3、 最大总和距离
为了使结果多样化,我们将 2 x top_n最相似的单词/短语用于文档。 然后,我们从 2 x top_n 单词中提取所有top_n组合并提取组合 通过余弦相似性彼此最不相似。
>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
use_maxsum=True, nr_candidates=20, top_n=5)
[('set training examples', 0.7504),
('generalize training data', 0.7727),
('requires learning algorithm', 0.5050),
('supervised learning algorithm', 0.3779),
('learning machine learning', 0.2891)]
4、最大边际相关性
为了使结果多样化,可以使用最大边距相关性(MMR)来创建 关键字/关键短语,也基于余弦相似性。结果 具有高度多样性:
>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
use_mmr=True, diversity=0.7)
[('algorithm generalize training', 0.7727),
('labels unseen instances', 0.1649),
('new examples optimal', 0.4185),
('determine class labels', 0.4774),
('supervised learning algorithm', 0.7502)]
低多样性的结果:
>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
use_mmr=True, diversity=0.2)
[('algorithm generalize training', 0.7727),
('supervised learning algorithm', 0.7502),
('learning machine learning', 0.7577),
('learning algorithm analyzes', 0.7587),
('learning algorithm generalize', 0.7514)]
5、嵌入模型
KeyBERT支持许多可用于嵌入文档和单词的嵌入模型:
- Sentence-Transformers
- Flair
- Spacy
- Gensim
- USE
可以去huggingface官网查看模型,并使用以下命令通过KeyBERT传递:sentence-transformersmodel
from keybert import KeyBERT
kw_model = KeyBERT(model='all-MiniLM-L6-v2')
或者使用您自己的参数选择句子转换器模型:
from keybert import KeyBERT
from sentence_transformers import SentenceTransformer
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
kw_model = KeyBERT(model=sentence_model)
来源:https://github.com/MaartenGr/KeyBERT