python 关键词抽取工具

最新推荐文章于 2024-08-08 07:55:31 发布

风吹半夏灬

最新推荐文章于 2024-08-08 07:55:31 发布

阅读量1.7k

点赞数 4

分类专栏：算法文章标签： python 算法自然语言处理

本文链接：https://blog.csdn.net/zhichaoxia/article/details/109511054

版权

算法专栏收录该内容

3 篇文章 0 订阅

订阅专栏

前言

依任务需要，对多种关键词抽取工具进行比较，测试、调研。特此记录相关调用方法及最终评测结果。

1、jieba

Github地址：https://github.com/fxsjy/jieba/
安装：pip install jieba -i https://pypi.douban.com/simple/

基于 TextRank 算法的关键词抽取

import jieba

def keyword_extraction(content):
    """TextRank关键词抽取"""
    keywords = jieba.analyse.textrank(content, topK=50, allowPOS=('n', 'v', 'vn'))
    return keywords

基于 TF-IDF 算法的关键词抽取

def keyword_tfidf(content):
	"""tf-idf关键词抽取"""
    keywords = jieba.analyse.extract_tags(content, topK=50, allowPOS=('n', 'v', 'vn'))
    return keywords

参数说明：

content：表示待提取文本
topK：返回权重最大的关键词数，默认值为20
withWeight：表示是否一并返回关键词权重值，默认值为False
allowPOS：表示仅包括指定词性的词，默认值为空，即不筛选

2、hanlp

Github地址：https://github.com/hankcs/HanLP
安装：pip install pyhanlp -i https://pypi.douban.com/simple/
修改自定义词典方法：修改~\Anaconda3\Lib\site-packages\pyhanlp\static\data\dictionary\custom\ CustomDictionary.txt路径下的文件

from pyhanlp import *

def keyword_hanlp(content):
	"""基于textrank算法"""
    keywords = HanLP.extractKeyword(content, 50)
    return keywords

3、snownlp

Github地址： https://github.com/isnowfy/snownlp
安装：pip install snownlp -i https://pypi.douban.com/simple/

from snownlp import SnowNLP

def keyword_snownlp(content):
"""textRank"""
    keywords = SnowNLP(content).keywords(50)
    return keywords

4、jiagu

Github地址：https://github.com/ownthink/Jiagu
安装：pip install -U jiagu -i https://pypi.douban.com/simple/

import jiagu

def keyword_jiagu(content):
	"""基于BiLSTM"""
	keywords = jiagu.keywords(content, 50)
    return keywords

5、harvestText

Github地址：https://github.com/blmoistawinde/HarvestText
安装：pip install --upgrade harvesttext

from harvesttext import HarvestText
ht = HarvestText()

def keyword_harvestText(content, method="tfidf"):
    if method == "tfidf":
    """调用jieba-tfidf方法"""
    	keywords = ht.extract_keywords(content, 50, method="jieba_tfidf", allowPOS={'n', 'v', 'vn'})
    elif method == "textrank":
    """基于networkx的textrank算法"""
     keywords = ht.extract_keywords(content, 50, method="textrank", allowPOS={'n', 'v', 'vn'})

6、SIFRank_zh

Github地址：https://github.com/sunyilgdx/SIFRank_zh
安装：下载GitHub源码调试，调试文件路径：~/SIFRank_zh-master/test/test.py

"""基于预训练模型ELMo+句向量模型SIF"""
keyphrases = SIFRank(content, SIF, zh_model, N=50, elmo_layers_weight=elmo_layers_weight)

keyphrases_ = SIFRank_plus(content, SIF, zh_model, N=50, elmo_layers_weight=elmo_layers_weight)

7、macropodus

Github地址：https://github.com/yongzhuo/Macropodus
安装：python3.6下pip install macropodus -i https://pypi.douban.com/simple/

import macropodus

def keyword_macropodus(content):
	"""基于Albert+BiLSTM+CRF"""
	keywords = macropodus.keyword(content)
    return keywords

结论

jiagu和hanlp关键词抽取结果中包含字母、数字记忆单个字情况，且无法直接根据词性过滤掉人名和机构名。
harvestText中引入了networkx的图与复杂网络分析算法，就测试效果而言与jieba-textrank不分伯仲。
SIFRank做了关键词聚合（就是将几个距离比较近的关键词合为一个），在测试时发现这种合并会导致关键词不通顺，不可用。
macropodus抽取关键词，不能设置额外参数，关键词抽取固定（改的话就得改源码了），而且有的文章抽不出关键词（很奇怪）。
最终试了一圈发现，还是jieba香！！！（也可能和自己处理的文本有关，大家可以自己亲自比较试试哈。）

风吹半夏灬

关注

4
点赞
踩
5

收藏

觉得还不错? 一键收藏
打赏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录