TF-IDF方法简介
TF-IDF,实际上是两个部分:TF和IDF的乘积。下面分别对两个次解释。
TF:词频。简单理解,就是词语在文章中出现的频率。计算方法也很简单:
即文档i中词语j的词频等于词语j在文档i中的出现次数nij除以文档i中所有词语的数量。
IDF:逆向词频,也叫反文档频率。首先了解一下文档频率DF:一个词在所有文档中出现的频率,如共有100篇文章,10篇文章中出现,则频率为0.1。那么,IDF就是这个DF的倒数,也就是10。之后,在分母上+1,防止分母为0,再取对数。逆向词频解决的问题是方式常用词霸占词频榜,导致提取出来的关键词都是没有意义的常用词...(例如介词)。
即词i的逆向词频等于文档总数除以包含词i的文档数+1,再取对数。
最终的tf-idf算法将词频和逆向词频相乘,解决了常用词的问题,便可提取出文章的关键词。
实现代码
string = "Automatic keyword extraction is to extract topical and important words or phrases form document or document set. It is a basic and necessary work in text mining tasks such as text retrieval and text summarization. This paper discusses the connotation of keyword extraction and automatic keyword extraction. In the light of linguistics, cognitive science, complexity science, psychology and social science, this paper studies the theoretical basis of automatic keyword extraction. From macro, meso and micro perspectives, the development, techniques and methods of automatic keyword extraction are reviewed and analyzed. This paper summarizes the current key technologies and research progress of automatic keyword extraction methods, including statistical methods, topic based methods, and network based methods. The evaluation approach of automatic keyword extraction is analyzed, and the challenges and trends of automatic keyword extraction are also predicted."
from jieba.analyse import *
# print(jieba.cut(str))
# print()
for keyword, weight in extract_tags(string, withWeight=True):
print('%s %s' % (keyword, weight))
# kw = tfidf(str)
# print(kw)
基于复杂网络的关键词提取方法
基于复杂网络的提取方法,主要是利用单词在文本中的共现解决问题。假设在一个句子中,词A和词B同时出现,那么在网络中就会新建一条权重为1的边(若已存在这条边,则权值+1即可&#