面向特定问题的开源算法管理和推荐(完)

 2021SC@SDUSC


系列文章目录

(一)组内分工情况

(二)任务一爬虫部分代码分析(上)

(三)任务一爬虫部分代码分析(下)

(四)任务一数据集统计代码分析

(五)任务二及PKE模型解读

(六)PKE模型使用一

(七)PKE模型使用二

(八)PKE代码分析一

(九)PKE代码分析二

(十)PKE代码分析三

(十一)PKE代码分析四

(十二)PKE代码分析五

(十三)PKE代码分析六

(十四)PKE代码分析七

(十五)PKE代码分析八

(十六)PKE代码分析九

(十七/完)项目总结

目录

系列文章目录

前言

一、模型介绍

(一)关键短语抽取简介

(二)业内方法介绍

无监督方法

有监督方法

二、课题结果

(一)multipartiterank

(二)positionrank

(三)embedrank

(四)tfidf

(五)yake

总结


前言

pke包含模型如下:

在之前的多篇文章分析了pke包的大部分模型,包括无监督与有监督模型,下面先对一部分模型(主要是前面没有分析的模型)进行简单介绍,之后对整体课题做下结果分析

一、模型介绍

(一)关键短语抽取简介

关键短语抽取 (keyphrase extraction),指从文章中提取典型的、有代表性的短语,期望能够表达文章的关键内容。

关键短语抽取对于文章理解、搜索、分类、聚类都很重要。而高质量的关键短语抽取算法,还能有效助力构建知识图谱。

常见的关键短语抽取方法分为有监督 (supervised)和无监督 (unsupervised)。整体抽取流程则分为2个步骤:(1) candidate generation,得到候选短语集合;(2) keyphrase scoring,对候选短语进行打分。

                               Figure 1 Keyphrase整体流程

(二)业内方法介绍

无监督方法

无监督的方法由于其不需要数据标注及普适性,得到了大范围的应用。

Figure 2 无监督方法概览

1. 基于统计的方法

  • 基于TFIDF的方法是最基本的版本,在得到候选短语集合的基础上(如,利用POS tags抽取noun phrases (NP)),使用term frequency, inverse document frequency对候选短语进行打分,选择高分短语作为关键短语。
  • YAKE[1]除了利用term frequency, term position,还利用了更多基于统计学的特征,希望能更好地表示短语的上下文信息和短语在文章中发挥的作用。

2. 基于图网络的方法

  • TextRank[2]是第一个基于图网络的关键短语抽取算法。该方法首先根据POS tags抽取候选短语,然后使用候选短语作为节点,创建图网络。两个候选短语如果共现于一定的窗口内,则在节点之间创建一条边,建立节点间的关联。使用PageRank[3]算法更新该图网络,直至达到收敛条件。
  • 此后,各种基于图网络的改进算法不断被提出,该类算法也逐渐成为无监督关键短语抽取中应用最广泛的算法。SingleRank[4]在TextRank之上为节点间的边引入了权重。PositionRank[5]通过引入短语的位置信息,创建一个biased weighted PageRank,从而提供了更准确的关键短语抽取能力。

3. 基于embedding的方法:这类方法,利用embedding来表达文章和短语在各个层次的信息(如:字、语法、语义等)。

  • EmbedRank[6]首先利用POS tags抽取候选短语,然后计算候选短语embedding和文章embedding的cosine similarity,利用相似度将候选短语排序,得到关键的短语。

Figure 3 典型无监督方法在benchmarks上的效果

有监督方法

虽然需要花费很多精力进行数据标注,但有监督方法在各个特定任务和数据集上,通常能够取得更好的效果。

Figure 4 有监督方法概览

1. 传统的方法

  • KEA[7]是较早期的算法,利用特征向量表示候选短语,如:tf-idf分数和初次出现在文章中的位置信息,使用Naïve Bayes作为分类,对候选短语进行打分和分类。在此之上,许多改进版本的算法也被提出,如:Hulth等人引入语言学知识,提出了改进版本[8]。CeKE[9]在对学术论文进行关键短语抽取时,通过使用论文的引用关系,引入更多特征信息,从而进一步提升了效果。
  • RankingSVM[10]使用learning to rank来建模该问题,将训练过程抽象为拟合ranking函数。
  • TopicCoRank[11]是无监督方法TopicRank的有监督扩展。该方法在basic topic graph之外,结合了第二个图网络。
  • CRF[12]是序列标注的经典算法,利用语言学、文章结果等各种来源特征表示文章,通过序列标注,得到文章的关键短语。

2. 基于深度学习的方法

  • RNN[13]使用了双层RNN结构,通过两层hidden layer来表征信息,并且利用序列标注的 方法,输出最终的结果。
  • CopyRNN[14]使用encoder-decoder结构进行关键短语抽取。首先,训练数据被转换为text-keyphrase pairs,然后训练基于RNN的encoder-decoder网络,学习从源数据 (sentence)到目标数据 (keyphrase)的映射关系。
  • CorrRNN[15]同样适用encoder-decoder结构,但是额外引入了两种限制条件:
    ① Keyphrases应该尽量覆盖文章的多个不同话题;
    ② Keyphrases应该彼此之间尽量不一样,保证多样性。

Figure 5 典型有监督方法在benchmarks上的效果

参考资料:关键短语抽取及使用BERT-CRF的技术实践 - 知乎 (zhihu.com)

二、课题结果

由于pke中包含多个模型,所以这里使用了多个模型进行提取

(一)multipartiterank

这个是课题最开始分给我的任务

1.首先先拿默认的英文进行测试,程序如下

import pke
import string
from nltk.corpus import stopwords

# 1. create a MultipartiteRank extractor.
extractor = pke.unsupervised.MultipartiteRank()

# 2. load the content of the document.
extractor.load_document(input='input.txt',language='en')

# 3. select the longest sequences of nouns and adjectives, that do
#    not contain punctuation marks or stopwords as candidates.
pos = {'NOUN', 'PROPN', 'ADJ'}
stoplist = list(string.punctuation)
stoplist += ['-lrb-', '-rrb-', '-lcb-', '-rcb-', '-lsb-', '-rsb-']
stoplist += stopwords.words('english')
extractor.candidate_selection(pos=pos, stoplist=stoplist)

# 4. build the Multipartite graph and rank candidates using random walk,
#    alpha controls the weight adjustment mechanism, see TopicRank for
#    threshold/method parameters.
extractor.candidate_weighting(alpha=1.1,threshold=0.74,method='average')

# 5. get the 10-highest scored candidates as keyphrases
keyphrases = extractor.get_n_best(n=3)
print(keyphrases)

其中输入为input.txt,其中的内容是我爬取数据的第一个摘要翻译成的英文,如下

The copyright "fair use" system is the main content of copyright restrictions in the copyright systems of various countries. The fair use system reflects the dual purpose of the copyright law to protect the interests of authors and other copyright owners and promote the widespread dissemination of knowledge and information. The justification of fair use can be It can be understood from many aspects, including the balance of incentives and proximity, the constitution and public interests, economic analysis based on transaction costs and classical economics, etc. In the network environment, the system of fair use of copyright still has its rationality. This is the manifestation of the regulations on fair use in the Regulations on the Protection of the Right to Dissemination of Information Networks promulgated and implemented by my country.

数据中的关键词是Copyright; fair use; legitimacy; public interest

结果如图

n=3时

 n=5时

[('fair use', 0.10302061204067117), ('copyright', 0.08481200673078822), ('system', 0.058771610698638184), ('interests', 0.04434068816359551), ('widespread dissemination', 0.03938691866267139)]

2.接下来换成我们的中文进行测试

代码如下

import pke
import string
from nltk.corpus import stopwords

# 1. create a MultipartiteRank extractor.
extractor = pke.unsupervised.MultipartiteRank()

# 2. load the content of the document.
extractor.load_document(input='input_zh.txt',language='zh')

# 3. select the longest sequences of nouns and adjectives, that do
#    not contain punctuation marks or stopwords as candidates.
pos = {'NOUN', 'PROPN', 'ADJ'}
stoplist = list(string.punctuation)
#stoplist += ['-lrb-', '-rrb-', '-lcb-', '-rcb-', '-lsb-', '-rsb-']

extractor.candidate_selection(pos=pos, stoplist=stoplist)

# 4. build the Multipartite graph and rank candidates using random walk,
#    alpha controls the weight adjustment mechanism, see TopicRank for
#    threshold/method parameters.
extractor.candidate_weighting(alpha=1.1,threshold=0.74,method='average')

# 5. get the 10-highest scored candidates as keyphrases
keyphrases = extractor.get_n_best(n=5)
print(keyphrases)

其中输入为input_zh.txt,其中的内容是我爬取数据的第一个摘要,如下

著作权"合理使用"制度是各国著作权制度中对著作权限制的主要内容.合理使用制度体现了著作权法保护作者和其他著作权人的利益与促进知识与信息广泛传播的双重目的.合理使用的正当性可以从多方面加以认识,包括激励与接近之平衡,宪法与公共利益,以交易成本和古典经济学为基础的经济学分析等.在网络环境下,著作权合理使用制度仍然有其存在的合理性.我国颁布实施的《信息网络传播权保护条例》对合理使用的规定即是这种体现.

数据中的关键词是著作权;合理使用;正当性;公共利益

结果如下

n=3时

[('使用 制度', 0.12132447967828075), ('各国 著作权 制度', 0.1125765135624001), ('古典 经济学', 0.08092805772072732)]

n=5时

[('使用 制度', 0.12132447967828075), ('各国 著作权 制度', 0.1125765135624001), ('古典 经济学', 0.08092805772072732), ('著作权', 0.07701539229963357), ('主要 内容', 0.067542739114407)]

可以看到效果并不好,因为以下原因

WARNING:root:No stopwords for 'zh' language.
WARNING:root:Please provide custom stoplist if willing to use stopwords. Or update nltk's `stopwords` corpora using `nltk.download('stopwords')`
WARNING:root:No stemmer for 'zh' language.
WARNING:root:Stemming will not be applied.

经过查阅资料后发现,pke包并不支持中文,虽然其调用的spacy包有中文模型的支持,但是其实现的其他处理函数没有适配中文

(二)positionrank

由于pke包中也包含positionrank模型,所以我也尝试了一下

也是分别用英文和中文进行测试,测试的输入与(一)一样,代码与结果如下

英文

import pke

        # define the valid Part-of-Speeches to occur in the graph
pos = {'NOUN', 'PROPN', 'ADJ'}

        # define the grammar for selecting the keyphrase candidates
grammar = "NP: {<ADJ>*<NOUN|PROPN>+}"

        # 1. create a PositionRank extractor.
extractor = pke.unsupervised.PositionRank()

        # 2. load the content of the document.
extractor.load_document(input='input.txt',language='en',normalization=None)

        # 3. select the noun phrases up to 3 words as keyphrase candidates.
extractor.candidate_selection(grammar=grammar,maximum_word_number=3)

        # 4. weight the candidates using the sum of their word's scores that are
        #    computed using random walk biaised with the position of the words
        #    in the document. In the graph, nodes are words (nouns and
        #    adjectives only) that are connected if they occur in a window of
        #    10 words.
extractor.candidate_weighting(window=10,pos=pos)

        # 5. get the 10-highest scored candidates as keyphrases
keyphrases = extractor.get_n_best(n=5)
print(keyphrases)

结果如下

n=3

[('fair use system', 0.23486112803260442), ('fair use', 0.17254912806676392), ('other copyright owners', 0.17211345865431515)]

n=5

[('fair use system', 0.23486112803260442), ('fair use', 0.17254912806676392), ('other copyright owners', 0.17211345865431515), ('copyright systems', 0.16633739048503726), ('copyright restrictions', 0.16493091422564646)]

中文

import pke

        # define the valid Part-of-Speeches to occur in the graph
pos = {'NOUN', 'PROPN', 'ADJ'}

        # define the grammar for selecting the keyphrase candidates
grammar = "NP: {<ADJ>*<NOUN|PROPN>+}"

        # 1. create a PositionRank extractor.
extractor = pke.unsupervised.PositionRank()

        # 2. load the content of the document.
extractor.load_document(input='input_zh.txt',language='zh',normalization=None)

        # 3. select the noun phrases up to 3 words as keyphrase candidates.
extractor.candidate_selection(grammar=grammar,maximum_word_number=3)

        # 4. weight the candidates using the sum of their word's scores that are
        #    computed using random walk biaised with the position of the words
        #    in the document. In the graph, nodes are words (nouns and
        #    adjectives only) that are connected if they occur in a window of
        #    10 words.
extractor.candidate_weighting(window=10,pos=pos)

        # 5. get the 10-highest scored candidates as keyphrases
keyphrases = extractor.get_n_best(n=5)
print(keyphrases)

结果如下

n=3

[('各国 著作权 制度', 0.20723655282406506), ('著作 权"', 0.15536170272274746), ('使用 制度', 0.15267089966314606)]

n=5

[('各国 著作权 制度', 0.20723655282406506), ('著作 权"', 0.15536170272274746), ('使用 制度', 0.15267089966314606), ('著作 权人', 0.13646647548621305), ('制度', 0.10120811375481914)]

(三)embedrank

同样的道理,也是由于课题任务中有,并且pke包中也包含,所以也同样测试了一下,

也是分别用英文和中文进行测试,测试的输入与(一)一样,代码与结果如下

英文

import string
import pke

        # 1. create an EmbedRank extractor.
extractor = pke.unsupervised.EmbedRank()

        # 2. load the content of the document.
extractor.load_document(input='input.txt',language='en',normalization=None)

        # 3. select sequences of nouns and adjectives as candidates.
extractor.candidate_selection()

        # 4. weight the candidates using EmbedRank method
extractor.candidate_weighting()

        # 5. get the 10-highest scored candidates as keyphrases
keyphrases = extractor.get_n_best(n=5)
print(keyphrases)

在过程中报错,通过检测,本来以为可以使用pip install sent2vec解决,结果安装后还会报错:module 'sent2vec' has no attribute 'Sent2vecModel'

查资料后发现:这是因为PyPI上的软件包是一个完全不同的软件包,具有相同的名称

正确的安装方法是在此处克隆其存储库GitHub - epfml/sent2vec: General purpose unsupervised sentence representations并按照自述文件中的说明进行操作

根据错误提示:Please download "sent2vec_wiki_bigrams" model from https://github.com/epfml/sent2vec#downloading-sent2vec-pre-trained-models.And place it in D:\ProgramData\Anaconda3\envs\py38\lib\site-packages\pke\models.

 16GB的库在github上,由于网络问题实在是下载不下来,所以放弃使用这个模型

(四)tfidf

程序执行过程中报错[Errno 2] No such file or directory: 'path/to/df.tsv.gz',将文件复制到对应位置后成功解决

英文

import string
import pke

        # 1. create a TfIdf extractor.
extractor = pke.unsupervised.TfIdf()

        # 2. load the content of the document.
extractor.load_document(input='input.txt',language='en',
                                normalization=None)

        # 3. select {1-3}-grams not containing punctuation marks as candidates.
extractor.candidate_selection(n=3,stoplist=list(string.punctuation))

        # 4. weight the candidates using a `tf` x `idf`
df = pke.load_document_frequency_file(input_file='df.tsv.gz')
extractor.candidate_weighting(df=df)

        # 5. get the 10-highest scored candidates as keyphrases
keyphrases = extractor.get_n_best(n=5)
print(keyphrases)

结果如下

n=3

[('copyright', 9.509775004326938), ('fair', 7.924812503605781), ('fair use', 7.924812503605781)]

n=5

[('copyright', 9.509775004326938), ('fair', 7.924812503605781), ('fair use', 7.924812503605781), ('the copyright', 4.754887502163469), ('of copyright', 3.1699250014423126)]

中文

import string
import pke

        # 1. create a TfIdf extractor.
extractor = pke.unsupervised.TfIdf()

        # 2. load the content of the document.
extractor.load_document(input='input_zh.txt',language='zh',
                                normalization=None)

        # 3. select {1-3}-grams not containing punctuation marks as candidates.
extractor.candidate_selection(n=3,stoplist=list(string.punctuation))

        # 4. weight the candidates using a `tf` x `idf`
df = pke.load_document_frequency_file(input_file='df.tsv.gz')
extractor.candidate_weighting(df=df)

        # 5. get the 10-highest scored candidates as keyphrases
keyphrases = extractor.get_n_best(n=5)
print(keyphrases)

结果如下

n=3

[('合理 使用', 7.924812503605781), ('著作权', 4.754887502163469), ('合理 使用 制度', 3.1699250014423126)]

n=5

[('合理 使用', 7.924812503605781), ('著作权', 4.754887502163469), ('合理 使用 制度', 3.1699250014423126), ('使用 制度', 3.1699250014423126), ('经济学', 3.1699250014423126)]

(五)yake

英文

import pke
from nltk.corpus import stopwords

        # 1. create a YAKE extractor.
extractor = pke.unsupervised.YAKE()

        # 2. load the content of the document.
extractor.load_document(input='input.txt',
                                language='en',
                                normalization=None)


        # 3. select {1-3}-grams not containing punctuation marks and not
        #    beginning/ending with a stopword as candidates.
stoplist = stopwords.words('english')
extractor.candidate_selection(n=3, stoplist=stoplist)

        # 4. weight the candidates using YAKE weighting scheme, a window (in
        #    words) for computing left/right contexts can be specified.
window = 2
use_stems = False # use stems instead of words for weighting
extractor.candidate_weighting(window=window,
                                      stoplist=stoplist,
                                      use_stems=use_stems)

        # 5. get the 10-highest scored candidates as keyphrases.
        #    redundant keyphrases are removed from the output using levenshtein
        #    distance and a threshold.
threshold = 0.8
keyphrases = extractor.get_n_best(n=5, threshold=threshold)
print(keyphrases)

结果如下

n=3

[('fair use', 0.01135173096383017), ('fair use system', 0.014403067595763973), ('various countries', 0.023628776688252303)]

n=5

[('fair use', 0.01135173096383017), ('fair use system', 0.014403067595763973), ('various countries', 0.023628776688252303), ('main content', 0.025831869116219618), ('use system reflects', 0.026526627786323434)]

中文

import pke
from nltk.corpus import stopwords

        # 1. create a YAKE extractor.
extractor = pke.unsupervised.YAKE()

        # 2. load the content of the document.
extractor.load_document(input='input_zh.txt',
                                language='zh',
                                normalization=None)


        # 3. select {1-3}-grams not containing punctuation marks and not
        #    beginning/ending with a stopword as candidates.
stoplist = stopwords.words('english')
extractor.candidate_selection(n=3, stoplist=stoplist)

        # 4. weight the candidates using YAKE weighting scheme, a window (in
        #    words) for computing left/right contexts can be specified.
window = 2
use_stems = False # use stems instead of words for weighting
extractor.candidate_weighting(window=window,
                                      stoplist=stoplist,
                                      use_stems=use_stems)

        # 5. get the 10-highest scored candidates as keyphrases.
        #    redundant keyphrases are removed from the output using levenshtein
        #    distance and a threshold.
threshold = 0.8
keyphrases = extractor.get_n_best(n=5, threshold=threshold)
print(keyphrases)

n=3

[('著作权', 0.1507300235092945), ('经济学', 0.2658687988084529), ('合理性', 0.5636270307985051)]

n=5

[('著作权', 0.1507300235092945), ('经济学', 0.2658687988084529), ('合理性', 0.5636270307985051), ('正当性', 0.5654389247550586), ('多方面', 0.5654389247550586)]


总结

以上调用了5个模型,可以直观看出关键词抽取的差别

表1不同模型性能对比分析

数据集

方法

Top3

Top5

PR

RR

F1

PR

RR

F1

Baiduxueshu

   multipartiterank

2/3

1/2

4/7

3/5

3/4

2/3

   PositionRank

1/3

1/4

2/7

1/5

1/4

2/9

        tfidf

2/3

1/2

4/7

2/5

1/2

4/9

       yake

        1/3

1/4

2/7

1/5

1/4

2/9

有一些结果分析,可以参考Python中七种主要关键词提取算法的基准测试 - 51CTO.COM

  • 0
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值