python 实现关键词提取

最新推荐文章于 2024-05-03 14:23:24 发布

W&J

最新推荐文章于 2024-05-03 14:23:24 发布

阅读量3.3w

点赞数 13

分类专栏：自然语言处理

本文链接：https://blog.csdn.net/hangzuxi8764/article/details/86901822

版权

自然语言处理专栏收录该内容

4 篇文章 0 订阅

订阅专栏

Python 实现关键词提取

看到一篇很好的关键词提取的论文，《融合LDA与TextRank算法的主题信息抽取方法》。里面对LDA和TextRank的发展过程描述的很详细。如果你跟我一样对着通篇的公式尝尝头痛时，可以多看几篇相关的研究生毕业论文，大部分毕业论文会对某一知识点给予很充分的信息融合，并融入自己的理解，尝尝能给你一些感性上的认识。

-----------------------分隔符-----------------------------------------

这篇文章只介绍了Python中关键词提取的实现。

关键词提取的几个方法：1.textrank 2.tf-idf 3.LDA，其中textrank和tf-idf在jieba中都有封装好的函数，调用起来十分简单便捷。常用的自然语言处理的库还有nltk,gensim，sklearn中也有封装好的函数可以进行SVD分解和LDA等。LDA也有人分装好了库，直接pip install lda进行安装即可。

jieba

先来看一下调用jieba中的textrank 和 tf-idf 抽取关键词的情况

import jieba.analyse
#准备语料
corpus = "《知否知否应是绿肥红瘦》是由东阳正午阳光影视有限公司出品，侯鸿亮担任制片人，张开宙执导，曾璐、吴桐编剧，赵丽颖、冯绍峰领衔主演，朱一龙、施诗、张佳宁、曹翠芬、刘钧、刘琳、高露、王仁君、李依晓、王鹤润、张晓谦、李洪涛主演，王一楠、陈瑾特别出演的古代社会家庭题材电视剧"

#textrank
keywords_textrank = jieba.analyse.textrank(corpus)
print(keywords_textrank)
>>> ['有限公司', '出品', '社会', '家庭', '制片人', '担任', '影视', '题材', '电视剧', '知否', '东阳', '出演', '执导']

#tf-idf
keywords_tfidf = jieba.analyse.extract_tags(corpus)
print(keywords_tfidf)
>>> ['知否', '领衔主演', '刘钧', '刘琳', '侯鸿亮', '张晓谦', '王一楠', '张佳宁', '李依晓', '冯绍峰', '王鹤润', '施诗', '陈瑾', '赵丽颖', '吴桐', '朱一龙', '曹翠芬', '王仁君', '曾璐', '高露']

由此看来tf-idf的结果貌似更好。在我做过的一些关键词提取任务中，大部分情况下tf-idf同textrank的结果都很相似，尽管textrank的理论基础要比tf-idf复杂很多。

下面我们看一下textrank和tf-idf中其他参数的设置

def textrank(self, sentence, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v'), withFlag=False):
        """
        Extract keywords from sentence using TextRank algorithm.
        Parameter:
            - topK: return how many top keywords. `None` for all possible words.
            - withWeight: if True, return a list of (word, weight);
                          if False, return a list of words.
            - allowPOS: the allowed POS list eg. ['ns', 'n', 'vn', 'v'].
                        if the POS of w is not in this list, it will be filtered.
            - withFlag: if True, return a list of pair(word, weight) like posseg.cut
                        if False, return a list of words
 
	    """
    
def extract_tags(self, sentence, topK=20, withWeight=False, allowPOS=(), withFlag=False):
        """
        Extract keywords from sentence using TF-IDF algorithm.
        Parameter:
            - topK: return how many top keywords. `None` for all possible words.
            - withWeight: if True, return a list of (word, weight);
                          if False, return a list of words.
            - allowPOS: the allowed POS list eg. ['ns', 'n', 'vn', 'v','nr'].
                        if the POS of w is not in this list,it will be filtered.
            - withFlag: only work with allowPOS is not empty.
                        if True, return a list of pair(word, weight) like posseg.cut
                        if False, return a list of words
        """

sentence：待提取关键词的语料

topK: 提取多少个关键词，默认为20个

withWeight: 若为True，返回值形式为（word, weight）。若为False，返回的只有words，默认为False

allowPOS: 允许哪些词性作为关键词，默认的词性为’ns’, ‘n’, ‘vn’, ‘v’

withFlag: 若为True，返回值形式为（word, pos）。若为False，返回的只有words，默认为False。其中pos为词性。

gensim

LDA

https://blog.csdn.net/Yellow_python/article/details/83097994

清洗语料输入到gensim中生成LDA主题模型前必须对语料进行分词
import jieba
import jieba.posseg as pseg

corpus = [
    '北京市气象台今晨发布1时至9时降水量',
    '降雪期间地面湿滑，能见度下降',
    '演员翟天临的博士学位及相关论文被网友质疑',
    '质疑翟天临“学术造假问题”']

#设定好需要保留的词性和剔除掉的停用词
jieba.add_word("博士",9,'n')
flags = ('n', 'nr', 'ns')
stopwords = ('时','是','的','被')

#分词
words = []
for sentence in corpus:
    word = [w.word for w in pseg.cut(sentence) if w.flag in flags and w.word not in stopwords]
    words.append(word)

构造词典
gensim库中的corpora的Dictionary中封装好了构建词典，词到索引的映射，索引到词的映射、将文档转为词袋模型等方法
from gensim.corpora import Dictionary
#构造词典（词典默认最大词汇量为200万）
dct = Dictionary(words)
#再添加文本
dct.add_documents([[‘翟天临’,‘博士后’,‘北京大学’]])
```
#词和索引的字典
print(dct.token2id)
>>>{'网友': 8, '地面': 3, '问题': 11, '降水量': 2, '学术': 10, '气象台': 1, ……}

#将一句话转为词袋模型，(索引，出现次数)
print(dct.doc2bow)
>>>[[(0, 1), (1, 1), (2, 1)], [(3, 1), (4, 1), (5, 1)], ……]
```
LDA模型
终于激动人心的时刻到了
from gensim import models
#num_topics 设置主题数目
lda = models.ldamodel.LdaModel(corpus=corpus_bow, id2word=dct, num_topics=2)

sklearn

lda

https://blog.csdn.net/doufreedom1992/article/details/75257774

最后说一句，官方文档和源码永远是最好的学习资料。

W&J

关注

13
点赞
踩
129

收藏

觉得还不错? 一键收藏
4
评论
python 实现关键词提取

Python 实现关键词提取这篇文章只介绍了Python中关键词提取的实现。关键词提取的几个方法：1.textrank 2.tf-idf 3.LDA，其中textrank和tf-idf在jieba中都有封装好的函数，调用起来十分简单便捷。常用的自然语言处理的库还有nltk,gensim，sklearn中也有封装好的函数可以进行SVD分解和LDA等。LDA也有人分装好了库，直接pip insta...
复制链接

扫一扫