Automated Rule Selection for Aspect Extraction in Opinion Mining（2020.10.20）

fuchengguo666

已于 2022-03-03 22:10:00 修改

阅读量170

点赞数

分类专栏： sentiment analysis 文章标签： nlp

于 2020-10-20 20:10:38 首次发布

本文链接：https://blog.csdn.net/fuchengguo666/article/details/109179868

版权

sentiment analysis 专栏收录该内容

20 篇文章 2 订阅

订阅专栏

Automated Rule Selection for Aspect Extraction in Opinion Mining

意见/观点挖掘中用于方面/特征提取的自动规则选择

一、Abstract

Aspect extraction aims to extract fine-grained opinion targets from opinion texts.方面提取旨在从意见/评论文本中提取细粒度的意见目标。
Recent work has shown that the syntactical approach, which employs rules about grammar dependency relations between opinion words and aspects, performs quite well.This approach is highly desirable in practice be-cause it is unsupervised and domain independent.最近的工作表明，采用关于意见词和方面之间的语法依存关系的规则的句法上的方法效果很好。这种方法在实践中是非常可取的，因为它是无监督的，并且与领域无关。
However, the rules need to be carefully selected and tuned manually so as not to produce too many errors.但是，规则需要仔细选择和手动调整，以免产生太多错误。
Although it is easy to evaluate the accuracy of each rule automatically, it is not easy to select a set of rules that produces the best overall result due to the overlapping coverage of the rules.尽管很容易自动评估每个规则的准确性，但是由于规则覆盖范围的重叠，选择一组产生最佳总体结果的规则并不容易。
In this paper, we propose a novel method to select an effective set of rules. To our knowledge, this is the first work that selects rules automatically.本文提出了一种选择有效规则集的新方法。据我们所知，这是第一个自动选择规则的工作。
Our experiment results show that the proposed method can select a subset of a given rule set to achieve significantly better results than the full rule set and the existing state-of-the-art CRF-based supervised method.我们的实验结果表明，该方法可以选择给定规则集的一个子集，取得比完整规则集和现有的基于CRF的有监督方法更好的结果。

一、Introduction

方面/特征提取是观点挖掘或情感分析的一项基本任务。它的目标是从评论文本中提取细粒度的评论目标。例如，在“This phone has a good screen/这部手机的屏幕很好”这句话中，我们想提取“screen/屏幕”。在产品评论中，"Asbact/方面"基本上是产品的属性或特性。方面提取对于意见挖掘很重要，因为在不知道意见所涉及的方面的情况下，意见的用途是有限的。（Liu, 2012）
近年来，特征提取得到了广泛的研究。主要有两种方法：句法方法和统计方法。一些已有的工作已经表明，基于句法依赖的方法，如双传播(DP)。
Note that,an aspect term can be a single word or a multi-word phrase(e.g., “battery life”). 注意，方面项可以是单个单词或多个单词的短语（例如，“电池寿命”）
Clearly, syntactic patterns can extract wrong terms. For example, in “It is a good idea to get this phone,” the words“good” and “idea” also have the dependency relation amod, but“idea” is not an aspect of the phone.显然，句法模式可以提取错误的术语。例如，在“It is a good idea to get this phone/得到这款手机的好主意”中，单词“ good”和“ idea”也具有依赖关系amod，但是“ idea”不是“phone”的一个方面。
为了产生好的结果，人们必须仔细选择规则。除了选择正确的规则外，在[Popescu and Etzioni，2005；Qiuet al.，2011]中还提出了一些启发式方法来修剪提取的方面以减少错误。然而，简单的启发式算法(如[Hu and Liu，2004]中的基于频率的剪枝方法)存在许多低频正确方面的剪枝问题，而复杂的启发式算法(如[Qiuet al.，2011]中的方法)在剪枝时需要外部知识。在现有作品中，研究人员通常会手动检查一组规则，例如[Qiuet等，2011]，然后选择一小部分更可靠的规则。然而，由于它们是通过试错来工作的，所以没有提供为什么选择一些规则而不选择另一些规则、如何判断规则是否好以及如何组合一组规则的原因。
It is important to have the reasons in order to select rules automatically as the manual approach is very difficult to guarantee quality. This paper addresses this difficulty and proposes an automated rule selection algorithm to choose a subset of good rules from a given set of rules由于人工方式很难保证质量，自动选择规则需要有原因，这一点很重要。本文针对这一难点，提出了一种自动规则选择算法，从给定的规则集中选择出一组好的规则子集。
Formally, our problem is stated as follows. Given a set of aspect extraction rulesR(each has an id), a set of seed opinion wordsO, and a set of reviews D with labeled aspects, we want to select a subset of rules in R that can be used to extract aspects from reviews across domains. This cross domain capability is important because we do not want to label data in each domain, which is highly labor-intensive.The selected set of rules are obviously domain independent because they are syntactic rules, and thus can be used across domains.
给定一组方面提取规则R(每个规则都有一个ID)、一组种子意见词O和一组带有标签方面的评论D，我们希望在R中选择可用于从跨域评论中提取方面的规则子集。这种跨域能力很重要，因为我们不希望标记每个域中的数据，这是高度劳动强度的。所选择的规则集显然是独立于域的，因为它们是语法规则，因此可以跨域使用。
The proposed algorithm can take any number of rules of any quality, and automatically select a good subset of rules for extraction. Selecting the optimal set of rules would involve evaluating all possible subsets of RonD and selecting the subset with the right rule sequence that performs the best.However, as the rule number m grows, the number of subsets grows exponentially, i.e.,2m. Thus, we propose a greedy algorithm which is inspired by rule learning in machine learning and data mining.
该算法可以提取任意数量的任意质量的规则，并能自动选择较好的规则子集进行抽取。选择最优的规则集将涉及评估所有可能的Rond子集，并选择具有正确的规则序列的子集执行得最好，然而，随着规则数m的增长，子集的数量呈指数增长，即2m。因此，受机器学习和数据挖掘中规则学习的启发，我们提出了一种贪婪算法。
我们的实验是使用一个流行的方面提取评估数据集进行的，该数据集包含对五种产品的注释评论[Hu and Liu，2004]，以及一些由我们注释的额外数据集。我们的实验结果表明，该方法能够从DP中使用的规则中选择规则子集，从而比DP中使用的原始规则集更准确地进行抽取。此外，由于我们的方法可以接受任何质量的规则作为输入，所以我们在DP中的规则集中添加了更多的新规则。所提出的方法选择产生比仅从DP规则中选择的结果更好的结果的规则子集。它的性能也大大超过了最先进的基于CRF的监督学习方法。

二、Related Work

主题建模通常只给出文档语料库中的一些粗略主题，而不是精确的方面，因为主题术语并不一定意味着一个方面[Lin和He，2009；Luet等人，2009；Zhaoet等人，2010a；Jo Andoh，2011；方和黄，2012]。例如，在电池主题中，主题模型可以找到与电池寿命相关的诸如“电池”、“寿命”、“天”和“时间”等主题术语，但是每个单独的词都不是一个方面。
In this paper, we focus on the syntactical approach. Our work is most related to the DP approach[Qiuet al., 2011],which we will discuss further below. Since our rule selection method uses some training data, it can be regarded as an integration of both supervised and unsupervised learning because its resulting rule set can be applied to the test data from any domain.
在本文中，我们将重点放在句法方法上。我们的工作与DP方法最相关[Qiuet等，2011]，我们将在下面进一步讨论。由于我们的规则选择方法使用了一些训练数据，因此可以将其视为有监督学习和无监督学习的集成，因为其结果规则集可以应用于来自任何领域的测试数据。
Our work is also related to associative classification[Liuet al., 1998], which uses association rule mining algorithms to generate the complete set of association rules, and then selects a small set of high quality rules for classification, and the pattern mining method in[Kobayashiet al., 2007].However, we do not need to generate rules as we already have such rules based on syntactical dependency relations. Our rule selection method is also very different.
我们的工作还涉及关联分类[Liuet et al。，1998]，该方法使用关联规则挖掘算法生成完整的关联规则集，然后选择一小套高质量规则进行分类，并使用模式挖掘方法进行分类。 [Kobayashiet et al。，2007]。但是，我们不需要生成规则，因为我们已经有了基于句法依赖关系的规则。我们的规则选择方法也有很大不同。

三、Syntactical Extraction Rules句法提取规则

This work uses dependency relations between opinion words and aspects, as well as between opinion words and aspects themselves to extract aspects.本文利用观点词与体之间的依存关系，以及观点词与体本身之间的依存关系来提取体。
For example, one extraction rule could be “if a wordA, whose part-of-speech (POS) is a singular noun (nn), has the dependency relation amod with(i.e., modified by) an opinion wordO, then A is an aspect”,which can be formulated by the following ruler1:例如，一条提取规则可以是“如果词性(POS)为单数名词(Nn)的词A与意见词O(即，被其修饰)具有依存关系，则A是一个方面”，其可以由以下规则1来表示。

depends(amod, A, O)表示A和O具有依赖关系amod，pos(A, nn)表示A是一个单数名词，opinionword(O)表示O是一个意见词，aspect(A)表示A是一个方面Aspect。例如，从“This phone has a great screen”，我们可以提取方面“screen”，因为“great”和“screen”之间有一个amod关系，“great”是一个观点词，“screen”是一个名词。
We group rules into three types based on whether they can extract aspects by themselves given a set of opinion words.For some rules, the propagation mechanism in DP is needed before they can be used in the extraction.我们将规则分为三类，根据它们本身是否能够提取各个方面，给出了一组观点，并给出了一些规则，在提取过程中必须先进行传播机制才能用于提取。
-Type 1 rules (R^1):类型1规则（R1）：使用意见词来提取方面（基于它们之间的某些依赖关系），例如Ruler1。先验地给出一组种子意见词.
Type 2 rules (R^2):使用方面提取方面。已知方面是在先前的传播中提取的。
For example, if “screen” has been extracted by a previous rule as an aspect, this rule can extract “speaker” as an aspect from“This phone has a great screen and speaker” because “screen”and “speaker” has the conj dependency relation.例如，如果已经由先前的规则提取了作为一个方面的“Screen”，则该规则可以从“This phone has a great screen and speaker”中提取“Speaker”作为一个方面，因为“Screen”和“Speaker”具有conj依赖关系。
Type 3 rules (R^3):使用方面和意见词来提取新的意见词。给定的方面是在先前的传播中提取的，给定的意见词是已知的种子，或者是在先前的传播中提取的。以下是此类规则的示例：
pos(O, jj)表示O是形容词.
-For example, if “screen” has been extracted as an aspect and “nice” was not a seed opinion word, then “nice” will be extracted as an opinion word by this rule from “This phone has a nice screen."
例如，如果“Screen”已被提取为一个方面，并且“nice”不是种子意见词，那么“nice”将根据此规则从“This phone has a nice screen.”中提取为一个意见词。

四、Rule Set Selection Algorithm规则集选择算法

4.1 The Main Idea and Steps
正如导言中提到的，寻找规则的最佳子集是一个不可行的问题，因此我们提出了一种贪婪算法来执行该任务，该算法分为三个步骤：
Step 1: Rule evaluation.规则估计
显然，规则质量是规则选择的一个关键准则。该步骤首先评估每个规则来评估其质量。具体来说，本文给出了一系列规则、一组种子意见词和一个训练数据d，并在句子中标注了标注的内容，对于每种规则，这个步骤将每个规则应用到d，并输出规则的精度和召回值。我们使用精度和召回率，因为我们希望规则具有高精度和召回率。
Step 2: Rule ranking.规则排名
此步骤首先根据规则的精度对每种类型的规则进行排名。如果两个规则具有相同的精度，则具有较高召回率的规则排名较高。我们首先使用精度，因为更需要高精度规则。可以通过使用更多规则来提高召回率，因此此步骤将为三种类型的规则生成三个等级，以供下一步选择规则时使用。
Step 3: Rule selection规则选择
给定每种类型的排序规则和训练数据D，此步骤将排序规则集中的规则按降序逐一添加到当前输出规则子集中。添加规则后，会将当前规则子集应用于D并对其进行评估，并记录当前规则子集的F1得分。此过程将继续进行，直到分级列表中的所有规则都添加到输出规则集并进行评估。然后，该算法对规则进行剪枝以产生最终的规则集，该规则集在D上给出最佳结果。在该步骤中，由于我们希望最终的规则集能够产生整体良好的特征提取结果，所以使用F1得分作为性能评估度量。在我们的实验中，F1分数也是最终的评估标准。
4.2 Rule Evaluation.规则评估
-.给定带有方面标签的训练数据（例如产品评论）D，对R ^ 1中每个规则的评估就很简单。我们简单地应用每个规则’r¹i ∈R¹和种子意见词O到D来提取所有可能的aspect，由A¹i( i∈[1，N]，其中N是类型1的规则的数目)表示。
-.The set of correct aspects in A¹i is denoted by T¹i, the precision of r¹i is defined by ‘r¹I.pre=|T¹i|/|A¹i|’, and the recall of r¹i is defined by r¹i.rec=|T¹i|/|All_lab|, where |.| means the size of a set, All_lab is the set of all labeled aspects in D.
A¹i中正确方面的集合由T¹i表示，r¹i的精度由’r¹i.pre=|T¹i|/|A¹i|'定义，r¹i的Recall由’r¹i.rec=|T¹i|/|all_lab|‘定义，其中’|.|'表示集合的大小，all_Lab是D中标记的所有方面的集合。
-.The set of all aspects extracted by rules of type 1 is denoted by ‘A¹=⋃^N(i=1)A¹i’. Note that, the rules in R1 are independent of each other.
由类型1的规则提取的所有方面的集合由’A¹ =⋃^N(i= 1)A¹i’表示。注意，R1中的规则彼此独立。
-.类型2的规则依赖于类型1的规则，因为它们需要一些已知的方面作为输入，以便提取新的方面。在给定’A¹’的条件下，我们将每个规则r²j∈R2应用于D，以获得由r²j提取的新方面，由’A²j(j∈[1，M]'表示，其中M是类型2的规则的数量，A²j∩A¹=∅)。A²j中的正确方面集合由t²j表示，r²j的精度由’r²j.pre=|t²j|/|a²j|'定义，r²j的召回由’r²j.rec=|t²j|/|All_lab|'定义。由类型2的规则提取的所有方面的集合由’A²=⋃^M(j=1)A²j’表示。
-.类型3的规则依赖于类型1和类型2的规则，因为类型3的规则不直接提取方面，并且必须借助于类型1和类型2的规则来产生新的方面。给定O，我们将每个规则r³k∈R³与R¹和R²中的所有规则一起应用于D，以通过将r³k添加到R1∪R2来获得新的方面，表示为A³k(k∈[1，L]，其中L是类型3的规则的数目，A³k∩(A¹∪A²)=∅)。A³k中正确方面的集合由T³k表示，R³k的精度由r³k.pre=|T³k|/|A³k|定义，其中对R³k的调用由**r³k.rec=|T³k|/|ALL_LAB|**定义。

4.3 Rule Ranking

评估所有规则后，将根据准确性和召回性对它们进行排名。三种类型的规则分别排名，即每种类型的规则相互排名。

4.4 Rule Selection

这里提出了一种基于规则排序的贪婪选择算法。该算法称为RS-DP(Rule Selection-DP的缩写)，在算法1中给出。RS-DP有四个子步。具体内容看原论文

五、Experiments

We now evaluate the proposed technique to assess the performance of aspect extraction of the selected rules.
我们现在对所提出的技术进行评估，以评估所选规则的方面提取的性能。

5.1 Datasets
-.在我们的实验中，我们使用了两个客户评论集合，一个来自[Hu and Liu，2004]，其中包含四个领域的五个评论数据集：数码相机(D1，D2)、手机(D3)、MP3播放器(D4)和DVD播放器(D5)。另一个是我们构建的，以进一步验证有效性。它包含三个域的三个评论数据集：计算机(D6)、无线路由器(D7)和扬声器(D8)。
5.2 Evaluation Metrics评估指标
我们采用precision、recall和F1-score作为我们的评估指标。
-.有两种计算结果的方法:
(1)基于每个方面术语的多次出现;
(2)基于每个方面术语的不同出现;
-.In a dataset, an important aspect often occurs many times, e.g., the aspect “picture” occurred 10 times in a set of camera reviews.在一个数据集中，一个重要的’aspect’经常出现很多次，例如，方面“picture”在一组camera review中出现了10次。
-.If none of its occurrences is extracted, it is considered as10 losses.如果没有提取其匹配项，则将其视为10个损失。
-.In (2), if any occurrence of “picture” is extracted, it is considered as one extraction. If none is extracted, it is considered as one loss.在(2)中，如果提取了任何出现的“picture”，则将其视为一次提取。如果没有提取，则视为一次损失。
-.(2) clearly makes sense, but (1) also makes good sense because it is crucial to get those important aspects extracted. Extracting (or missing) a more frequent aspect term is rewarded (or penalized) more heavily than extracting (or missing) a less frequent one.(2)显然是有意义的，但(1)也很有意义，因为提取这些重要方面是至关重要的。提取(或丢失)频率较高的方面项比提取(或丢失)频率较低的方面项受到更大的奖励(或惩罚)
-.Let an extraction method return a set A of distinct aspect terms, and the set of distinct aspect terms labeled by human annotators be T. 让提取方法返回一组不同aspect术语的集合A，并且由人类注释者标记的一组不同方面术语的集合为T。
-.TP (true positives) is|A∩T|, FP (false positives) is|A\T|, FN (false negatives) is|T \A|.TP(真阳性)是|A∩T|，FP(假阳性)是|A\T|，Fn(假阴性)是|T\A|
-.For (2), the evaluation metrics are defined as follows:对于(2)，评估指标定义如下:

-.For (1), F1-score is computed in the same way, but precision and recall computations need to change because we now consider multiple occurrences of the same aspect.对于(1)，F1分数的计算方式相同，但precision和recall需要更改，因为我们现在考虑同一方面的多次出现。
其中，fi是ai的频率项，如果ai是A(或T)的元素，则E(ai，A)(或E(ai，T))等于1，否则E(ai，A)(或E(ai，T))等于0。
5.3 Compared Approaches比较方法
-.In the experiments, we compare our approach with DP[Qiuetal., 2011]and CRF[Jakob and Gurevych, 2010]. The reason to compare with DP is that we use DP rules as the input of our algorithm.在实验中，我们将我们的方法与DP [Qiuetal。，2011]和CRF [Jakob and Gurevych，2010]进行了比较。与DP进行比较的原因是我们将DP规则用作算法的输入。
-.We compare also with CRF as our approach is supervised. In total, we consider the following approaches,DP, DP⁺, RS-DP, RS-DP⁺, CRF and CRF⁺.我们还对CRF进行了监督，并与CRF进行了比较。总体而言，我们考虑以下方法：DP，DP⁺，RS-DP，RS-DP⁺，CRF和CRF⁺。
-.DP +仍像DP中一样使用8个方面的提取模式。区别在于DP +在模式中使用更多的依赖关系。 DP +使用18个依赖关系，即amod，prep，nsubj，csubj，xsubj，dobj，iobj，conj，advmod，dep，cop，mark，nsubjpass，pobj，acomp，xcomp，csubjpass和poss。
-.Cross domain test.跨域测试
1）In testing RS-DP, RS-DP+, CRF and CRF+,to reflect cross domain aspect extraction, we use leave-one-out cross validation for D1 to D5, i.e., the algorithm selects rules based on the annotated data from four products, and tests the selected rules using the unseen data from the remaining product.
在测试RS-DP，RS-DP +，CRF和CRF +以反映跨域方面的提取时，我们对D1至D5使用了留一法式交叉验证，即该算法根据来自四个产品的带注释数据选择规则，并使用剩余产品中看不见的数据测试所选规则。
2）for D6 to D8, the algorithm selects rules based on the annotated data from D1 to D5, and tests the selected rules using each of the data from D6 to D8. This simulates the situation that the selected rules can be applied to any domain(or in a domain independent matter)
对于D6到D8，该算法基于D1到D5的注释数据选择规则，并使用D6到D8的每个数据测试所选规则。这模拟了所选规则可以应用于任何域的情况（或独立于域的情况）
3）值得注意的是，在所有六种方法中，我们不仅提取单个名词体词，还提取名词短语(多词表达式）
5.4 Experimental Results
但是，CRF和CRF +都比RS-DP和RS-DP +差得多。我们认为，主要原因之一是基于规则的方法可以传播并迭代地进行改进，而基于CRF的方法则无法做到。

六、Conclusion

This paper proposed an automated rule set selection/learning method with the goal of improving the syntactical rule-based approach to aspect extraction in opinion mining.
本文提出了一种自动规则集选择/学习方法，旨在改进意见挖掘中基于句法规则的方面提取方法。
In our future work, we plan to employ semantic rule patterns, which can be learned or designed based on semantic parsing in addition to syntactic rule patterns as in theDP method.
所提出的技术要有效得多。在我们未来的工作中，我们计划使用语义规则模式，除了DP方法中的句法规则模式外，还可以基于语义解析来学习或设计语义规则模式。
We also plan to explore other possible algorithms for rule selection, such as simulated annealing strategies and genetic algorithms.
我们还计划探索其他可能的规则选择算法，如模拟退火策略和遗传算法。

fuchengguo666

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Automated Rule Selection for Aspect Extraction in Opinion Mining（2020.10.20）

Automated Rule Selection for Aspect Extraction in Opinion Mining意见/观点挖掘中用于方面/特征提取的自动规则选择一、AbstractAspect extraction aims to extract fine-grained opinion targets from opinion texts.方面提取旨在从意见/评论文本中提取细粒度的意见目标。Recent work has shown that the syntactical ap
复制链接

扫一扫