Introduction——2019_IEEE_TKDE_Gan_Survey_of_Utility_Mining

Introduction

  1. Data mining [1], [2] focuses on extraction of information from a large set of data and transforms it into an easily interpretable structure for further use.

    extraction:提取物;抽取。
    Interpretable:可说明的;可判断的。

    数据挖掘[1],[2]侧重于从大量数据中提取信息,并将其转换为易于解释的结构以供进一步使用。

  2. It is an interdisciplinary field focused on scientific methods, processes, and systems to extract knowledge or insights from data in various forms, either structured or unstructured. Mining interesting patterns from different types of data is quite important in many real-life applications [1], [3],[4], [5], [6].

    interdisciplinary:各学科间的;跨学科的

    它是一个跨学科的领域,侧重于科学方法、过程和系统,从各种形式的数据中提取知识或见解,无论是结构化的还是非结构化的。在许多实际应用程序中,从不同类型的数据中挖掘有趣的模式非常重要。

  3. In recent decades, the task of interesting pattern mining [e.g.,frequent pattern mining(FPM) [7], [8],association rule mining(ARM) [9], [10],frequent episode mining(FEM)[11], [12], [13], [14], andsequential pattern mining(SPM) [5],[15], [16], [17]] has been extensively studied.

    episode:情节;事件

    在最近的几十年里,有趣的模式挖掘任务频繁模式挖掘(FPM)[7]、[8]、关联规则挖掘(ARM)[9]、[10]、频繁集挖掘(FEM)[11]、[12]、[13]、[14]、顺序模式挖掘(SPM)[5]、[15]、[16]、[17]等都得到了广泛的研究。

  4. These are important and fundamental data mining techniques [1] that satisfy the re-quirements of real-world applications in numerous domains. Most of them aim at extracting the desired patterns using frequency or co-occurrence [7], [8], [9], [10], as well as other properties and interestingness measures [18], [19], [20], [21].

    co-occurrence:共现
    property : 性质

    这些都是重要的基础数据挖掘技术,满足了众多领域中真实应用的需求。大多数方法的目的是利用频率或共现[7],[8],[9],[10],以及其 他性质和兴趣度度量[18],[19],[20],[21]来提取所需的模式。

  5. Despite the wide use of pattern mining techniques, most of these algorithms do not allow for the discovery of utility-oriented patterns, i.e., those that contribute the most to a predefined utility threshold, an objective function, or a performance metric.

    utility-oriented:面向效用的
    oriented:以…为导向的
    performance metric:性能指标

    尽管模式挖掘技术得到了广泛的应用,但大多数算法都不允许发现面向效用的模式,即那些对预定义的效用阈值、目标函数或性能指标贡献最大的模式。

  6. In general, some implicit factors, such as the utility, interestingness, or risk of objects/patterns, are commonly seen in real-world situations. The knowledge that is actually important to the user may not be found by traditional data mining algorithms. Therefore, a novel utility mining framework, called utility-oriented pattern mining (UPM) or high-utility pattern mining (HUPM1)[22], [23], [24], which considers the relative importance of items (utility-oriented[25]), has become an emerging research topic in recent years. In UPM,the utility(i.e., importance, interest, satisfaction, or risk) of each item can be predefined based on a user’s background knowledge or preferences.

    implicit:含蓄的;暗示的。

    通常来说,一些暗含因素,如效用、兴趣或对象/模式的风险,在现实世界中很常见。传统的数据挖掘算法可能无法找到对用户真正重要的知识。因此,一种考虑项目相对重要性(效用导向[25])的新型效用挖掘框架,即面向调用的模式挖掘(UPM)或高效用模式挖掘(HUPM1)[22],[23],[24],成为近年来新兴的研究课题。在UPM,效用(即例如,重要性、兴趣、满意度或风险)可以根据用户的背景知识或偏好来预定义。

  7. According to Wikipedia2, in economics, utility is a measure of preferences over some set of goods (including services, i.e., something that satisfies human wants); it represents satisfaction experienced by the consumer of a good. Hence, utility is a subjective measure.This definition indicates that a subjective value is associated with a specific value in a domain to express user preference. In practice, the value of utility is assigned by the user according to his interpretation of domain-specific knowledge measured by a specific value, such as cost, profit, or aesthetic value. According to the studies of Li et al. [18], interestingness measures can be classified as objective measures, subjective measures, and semantic measures [18], [20], [21]. Objective measures [21], [26], such as support or confidence for pattern mining, are based only on data itself, whereas subjective measures [27], [28], such as unexpectedness or novelty, take into account the user’s domain knowledge. For the semantic measures [24], such as utility, they consider the data itself, as well as the user’s expectation. Hence, utility is a quantitative representation of user preference, and the usefulness of an itemset is quantified in terms of its utility value. Utility can be defined as “A measure of how ‘useful’ (i.e.,profitable) an itemset is” [24], [29].

    aesthetic:美学的
    semantic:语义上的
    whereas:然而
    expectation:期待
    quantitative:数量的
    interpretation:解释

    根据维基百科,在经济学中,效用是对一组商品(包括服务,即满足人类需求的东西)的偏好的衡量;它代表消费者对某一商品的满意体验。因此,效用是一种主观度量。*这个定义表明主观值与领域中的特定值相关联,以表示用户偏好。*在实践中,效用的价值是由用户根据他对特定领域的知识的解释来分配的,这些知识是由特定的价值(如成本、利润或美学价值)衡量的。根据Li等人.[18]的研究,兴趣度测量可以分为客观测量、主观测量和语义测量[18]、[20]、[21]。客观的测量[21],[26],如模式挖掘的支持度和置信度,只基于数据本身,而主观的测量[27],[28],如意外性或新颖性,考虑用户的领域知识。对于语义度量[24],比如效用,他们考虑数据本身以及用户的期望。因此,效用是用户偏好的量化表示,项目集的有用性是根据其效用价值进行量化的。效用可以定义为“一个项目集‘有用’(即盈利)的度量”[24],[29]。

  8. Formally, a pattern is said to be useful to a user if it satisfies a specific utility constraint. In practice, the utility value of a pattern can be measured in terms of cost, profit, aesthetic value, or other measures of user preference.

    一般来说,如果模式满足特定的效用约束,它就被认为对用户是有用的。在实践中,模式的效用价值可以用成本、利润、美学价值或用户偏好的其他度量来衡量。

  9. To address these issues,utility-oriented pattern mining(here-inafter called UPM) has become a useful task and an important topic in data mining. In UPM, each object/item has an unit utility (e.g., unit profit) and can appear more than once in each transaction or event (e.g., purchase quantity). The utility of a pattern represents its importance or satisfaction, which can be measured in terms of risk, profit, cost, quantity, or other information depending on user preference.

    *为了解决这些问题,面向效用的模式挖掘(以下简称UPM)已经成为数据挖掘中的一个重要课题。*在UPM中,每个对象/物品都有一个单位效用(如单位利润),并且可以在每笔交易或事件(如购买量)中出现不止一次。模式的效用代表了它的重要性或满意度,可以用风险、利润、成本、数量或其他取决于用户偏好的信息来衡量。

  10. In general, the utility of a pattern is based on local transaction utility (also called internal utility) and external utility[24], [29]. The internal utility of an object/item is defined according to the information stored in a transaction/event, such as the quantity of the object/item occurred or sold. The external utility can be a measure for describing user preferences. Therefore, the utility of a pattern depends on the utility function specified by the user, which can be theSum, Average, or Multiplication of quantity and profit of this pattern in databases.

    通常,模式的效用基于本地交易效用(也称为内部效用和外部效用[24],[29]。对象/物品的内部效用是根据存储在交易/事件中的信息来定义的,比如发生或售出的对象/物品的数量。外部效用可以用来衡量用户的偏好。因此,模式的效用取决于用户指定的效用函数,效用函数可以是该模式在数据库中的数量和利润的总和、平均值或乘法。

  11. More specifically, the utility-based method for pattern mining can find various types of patterns that could not be identified using previous theories and techniques. According to previous studies, UPM has a wide range of applications, including website click-stream analysis [30], [31],[32], cross-marketing in retail stores [33], [34], mobile commerce environment [35], [36], gene regulation [37], and biomedical applications [38]. Through 15 years of study and development, many techniques and approaches have been extensively proposed for UPM in various applications. As shown in Fig.1, there has been a rapid surge of interest of UPM in recent years in terms of the number of academic papers published in several sub-fields, including high-utility itemsets [29], high-utility rules [39], [40], high-utility sequential patterns [41], [42], and high-utility episodes[43], [44].

    surge:激涌

    更具体地说,基于效用的模式挖掘方法可以找到以前的理论和技术无法识别的各种类型的模式。*根据前人的研究,UPM的应用范围很广,包括网站点击流分析[30]、[31]、[32]、零售商店交叉营销[33]、[34]、移动商务环境[35]、[36]、基因调控[37]、生物医学应用[38]等。经过15年的研究和发展,人们已经广泛地提出了许多用于各种应用的技术和方法。*如图1所示,近年来UPM在高效用项集[29]、高效用规则[39]、[40]、高效用序列模式[41]、[42]、高效用集[43]、[44]等多个子领域的学术论文发表数量出现了快速增长。

  12. In spite of the fact that there are a considerable number of existing published studies and surveys about data mining, especially
    for pattern mining, none of them discuss UPM. Yet, after more than 15 years of theoretical development, a significant number of new technologies and applications have appeared in the UPM field. Unfortunately, there is no comprehensive survey of utility-oriented pattern mining methods and no study that systematically compares the state-of-the-art algorithms. We believe that now is a good time to summarize the new technologies and address the gap between theory and application. Here, we attempt to find a clearer way to present the concepts and practical aspects of UPM for the data mining research community. In this paper, we provide a systematic and comprehensive survey of the significant advances in UPM. The methods discussed in this article are not only important for high-utility pattern (i.e., itemset [24], [29], rule [39], [40], sequence, episode, etc.) mining but can also serve as inspiration for other data mining tasks [1], [2], including episode mining [11], [12], [13], [14], distributed data mining[45], and incremental/dynamic data mining [46], [47]. The major contributions are listed as follows
    :

    incremental:增量的
    dynamic:动态的

    尽管已经发表了大量关于数据挖掘的研究和调查,特别是关于模式挖掘的研究和调查,但它们都没有讨论UPM。然而,经过15年多的理论发展,UPM领域出现了大量的新技术和新应用。遗憾的是,目前还没有对面向实用的模式挖掘方法进行全面的调查,也没有对最先进的算法进行系统比较的研究。我们认为,现在是总结新技术、解决理论与应用之间差距的好时机。在这里,我们试图为数据挖掘研究社区找到一种更清晰的方式来展示UPM的概念和实践方面。在本文中,我们系统和全面地综述了UPM的重要进展。本文所讨论的方法不仅对高效用模式(即itemset [24], [29], rule [39], [40], sequence, episode等)挖掘具有重要意义,而且对其他数据挖掘任务[1],[2],包括集挖掘[11],[12],[13],[14],分布式数据挖掘[45],增量/动态数据挖掘[46],[47]。主要贡献如下:

  13. (1)This paper first presents the background, motivation, and a comprehensive survey of UPM (Section1). This survey investigates more than 150 UPM papers published in the last 15 years and summarizes them in a systematic fashion.(2)This survey first introduces an in-depth understanding of UPM, including concepts, examples, comparisons with related studies (e.g., FPM, SPM), applications, and evaluation measures (Section2). This survey presents a bird’s eyes view, and then deeply and comprehensively
    summarizes the developments of UPM, comparing the state-of-the-art works to earlier works (Section3).
    (3)A taxonomy of the most common and the state-of-the-art approaches for UPM is presented, including Apriori-based, tree-based, projection-based, vertical/horizontal-data-format-based, and other hybrid approaches (Section3). We further analyze the pros and cons of each presented approach.(4)A comprehensive review of advanced topics of utility mining techniques (e.g., dynamic UPM, concise representation of utility patterns, HUSPM, HUEM, UPM in big data, and privacy preserving for UPM) is presented (Section4), with a discussion of their pros and cons. Not only the representative algorithms but also the advances and latest progress are reviewed.(5)We further review some well-known open-source soft-ware and datasets (Section5 of UPM and hope that these resources may reduce barriers for future research.Finally, we identify several important issues and research opportunities for UPM (Section6).

    taxonomy:分类法
    pros and cons:优点和缺点
    concise:简明的;简洁的

    (1)本文首先介绍了UPM的背景、动机和综合概况(第1节)。这项调查调查了过去15年发表的150多篇UPM论文,并对它们进行了系统的总结。 (2)本调查首先介绍了对UPM的深入理解,包括概念、实例、与相关研究(如FPM、SPM)的比较、应用和评价措施(第2节)。本概览以鸟瞰的视角,对UPM的发展进行了深入全面的总结,并将最新的作品与早期的作品进行了比较(第3节)。(3)本文介绍了最常用和最先进的UPM方法的分类,包括基于先验的、基于树的、基于投影的、基于垂直/水平数据格式的以及其他混合方法(第3节)。我们进一步分析了每种方法的优缺点。(4)对效用挖掘技术(如动态UPM、效用模式的简明表示、HUSPM、HUEM、大数据中的UPM和UPM的隐私保护)的高级主题进行了全面的综述(第4节)。并对它们的优缺点进行了讨论。不仅对有代表性的算法进行了综述,而且对它们的进展和最新进展进行了综述。(5)我们进一步回顾了一些知名的开源软件和数据集(UPM第5节),希望这些资源可以减少未来研究的障碍。最后,我们确定了UPM的几个重要问题和研究机会(第6节)。

  14. The remainder of this article is organized as follows. In Section2, we introduce the necessary background information, the basic concepts and examples, and the applications in this field. In Section3, we give a high-level overview of emerging UPM problems and survey several popular methods, as well as recent developments. In Section4, we discuss advanced topics and techniques of UPM. In addition, several well-known open-source software and datasets are summarized in Section5. We describe some open challenges and opportunities in Section6. Several future directions are described in Section7.

    give a high-level overview of emerging UPM problems:对正在出现的UPM问题进行高层概述

    本文的其余部分组织如下。在第二部分,我们将介绍必要的背景信息,基本概念和例子,以及在这一领域的应用。在第3节中,我们对正在出现的UPM问题进行了高层概述,并对几种流行的方法以及最近的发展进行了调查。在第4节中,我们将讨论UPM的高级主题和技术。此外,在第5节中还总结了一些知名的开源软件和数据集。我们将在第6节中描述一些开放的挑战和机遇。在第7节中描述了几个未来的方向。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值