Learning under Concept Drift：A Review

最新推荐文章于 2024-06-26 09:47:18 发布

Lilyan_blog

最新推荐文章于 2024-06-26 09:47:18 发布

阅读量4.4k

点赞数

文章标签：机器学习数据挖掘

本文链接：https://blog.csdn.net/lytwy123/article/details/111303139

版权

本文回顾了130多篇关于概念漂移的研究，分析了检测、理解和适应漂移的方法。概念漂移是流数据中目标变量统计属性随时间变化的问题，影响机器学习的准确性。文章讨论了检测漂移的新方法，如多重假设测试，并强调了理解漂移的何时、如何和何处的重要性。适应性学习和模型调整在应对漂移方面的作用也得到阐述。该文列举了常用数据集和基准，指出了未来研究方向，包括无监督和半监督学习的漂移处理。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Learning under Concept Drift: A Review

Abstract

Concept drift describes unforeseeable changes in the underlying distribution of streaming data over time. Concept driftresearch involves the development of methodologies and techniques for drift detection, understanding and adaptation. Data analysishas revealed that machine learning in a concept drift environment will result in poor learning results if the drift is not addressed. To helpresearchers identify which research topics are significant and how to apply related techniques in data analysis tasks, it is necessarythat a high quality, instructive review of current research developments and trends in the concept drift field is conducted. In addition,due to the rapid development of concept drift in recent years, the methodologies of learning under concept drift have becomenoticeably systematic, unveiling a framework which has not been mentioned in literature. This paper reviews over 130 high qualitypublications in concept drift related research areas, analyzes up-to-date developments in methodologies and techniques, andestablishes a framework of learning under concept drift including three main components: concept drift detection, concept driftunderstanding, and concept drift adaptation. This paper lists and discusses 10 popular synthetic datasets and 14 publicly availablebenchmark datasets used for evaluating the performance of learning algorithms aiming at handling concept drift. Also, concept driftrelated research directions are covered and discussed. By providing state-of-the-art knowledge, this survey will directly supportresearchers in their understanding of research developments in the field of learning under concept drift.

概念漂移描述了流数据的底层分布随时间的变化。概念漂移研究涉及到漂移检测、理解和适应的方法和技术的发展。数据分析表明，在概念漂移环境下，如果不解决漂移问题，机器学习将导致学习效果不佳。为了帮助研究人员确定哪些研究主题是重要的，以及如何在数据分析任务中应用相关技术，有必要对概念漂移领域的当前研究进展和趋势进行高质量、指导性的回顾。此外，由于近年来概念漂移的迅速发展，概念漂移下的学习方法也变得明显的系统化，揭示了一个文献中未曾提及的框架。本文回顾了概念漂移相关研究领域的130多篇高质量出版物，分析了方法和技术的最新进展，建立了概念漂移下的学习框架，包括概念漂移检测、概念漂移理解和概念漂移适应。本文列出并讨论了10个常用的合成数据集和14个公开可用的benchmark数据集，用于评估针对概念漂移的学习算法的性能。并对概念漂移的相关研究方向进行了探讨。通过提供最新的知识，本调查将直接支持研究人员了解概念漂移下学习领域的研究进展。

Introduction

GOVERNMENTSand companies are generating hugeamounts of streaming data and urgently need efficientdata analytics and machine learning techniques to sup-port them making predictions and decisions. However, therapidly changing environment of new products, new mar-kets and new customer behaviors inevitably results in theappearance of concept drift problem. Concept drift meansthat the statistical properties of the target variable, which themodel is trying to predict, change over time in unforeseenways [1]. If the concept drift occurs, the induced pattern ofpast data may not be relevant to the new data, leading topoor predictions and decision outcomes. The phenomenonof concept drift has been recognized as the root cause ofdecreased effectiveness in many data-driven informationsystems such as data-driven early warning systems anddata-driven decision support systems. In an ever-changingand big data environment, how to provide more reliabledata-driven predictions and decision facilities has become acrucial issue.

政府和公司正在生成大量的流数据，迫切需要高效的数据分析和机器学习技术来支持他们做出预测和决策。然而，随着新产品、新市场、新顾客行为的急剧变化，不可避免地会出现概念漂移问题。概念漂移是指模型试图预测的目标变量的统计特性随着时间的推移以不可预见的方式发生变化[1]。如果概念漂移发生，则过去数据的诱导模式可能与新数据无关，从而导致拓扑预测和决策结果。在许多数据驱动的信息系统，如数据驱动的预警系统和数据驱动的决策支持系统中，概念漂移现象被认为是导致系统有效性下降的根本原因。在不断变化的大数据环境中，如何提供更可靠的数据驱动预测和决策工具已成为一个重要的问题。

Concept drift problemexists in many real-world situations.Anexamplecanbeseeninthechangesofbehaviorinmobilephoneusage,asshowninFig.1.Fromthebarsinthisfigure,thetimepercentagedistributionofthemobilephoneusagepatternhaschangedfrom“AudioCall”to“Camera”andthento“MobileInternet”overthepasttwodecades

概念漂移问题存在于更现实的情况下-例如使用手机的行为学方法，如图1所示。从图中的数字可以看出，在过去的二十年里，移动电话使用模式的时间百分比分布已经从“音频呼叫”变为“摄像头”和“移动互联网”

Recent attractive research in the field of concept drifttargets more challenging problems, i.e., how to accuratelydetect concept drift in unstructured and noisy datasets [2],[3], how to quantitatively understand concept drift in aexplainable way [4], [5], and how to effectively react to driftby adapting related knowledge [6], [7]

近年来概念漂移领域的研究热点是更具挑战性的问题，即如何在非结构化和噪声数据集中准确地检测概念漂移[2]、[3]、如何以可解释的方式定量理解概念漂移[4]、[5]，以及如何运用相关知识有效应对漂移[6]，[7]

Solving these challenges endows prediction and decision-making with the adaptability in an uncertain envi-ronment. Conventional research related to machine learninghas been significantly improved by introducing conceptdrift techniques in data science and artificial intelligence in general, and in pattern recognition and data stream miningin particular. These new studies enhance the effectiveness ofanalogical and knowledge reasoning in an ever-changingenvironment. A new topic is formed during this devel-opment: adaptive data-driven prediction/decision systems.In particular, concept drift is a highly prominent and sig-nificant issue in the context of the big data era becausethe uncertainty of data types and data distribution is aninherent nature of big data

在不确定的环境中具有适应性地解决这些挑战性的预测和决策。通过在数据科学和人工智能中引入概念漂移技术，与机器学习相关的传统研究得到了显著改善一般，特别是模式识别和数据流挖掘。这些新的研究增强了在不断变化的环境中进行逻辑推理和知识推理的有效性。在这一发展过程中形成了一个新的课题：自适应数据驱动的预测/决策系统。输入尤其是在大数据时代背景下，概念漂移是一个非常突出和重要的问题，因为数据类型和数据分布的不确定性是大数据的一个内在本质。

Conventional machine learning has two main compo-nents: training/learning and prediction. Research on learn-ing under concept drift presents three new components:drift detection (whether or not drift occurs), drift under-standing (when, how, where it occurs) and drift adaptation(reaction to the existence of drift) as shown in Fig. 2. Thesewill be discussed in Section 3-5.

传统的机器学习有两个主要组成部分：训练/学习和预测。概念漂移下的学习研究提出了三个新的c组件：漂移检测（无论是否发生漂移）、漂移是否存在（何时、如何、在何处发生）和漂移适应（对漂移存在的反应），如图2所示。这些将在第3-5节中讨论。

In literature, a detailed concept drift survey paper [8]was published in 2014 but intentionally left certain sub-problems of concept drift to other publications, such asthedetailsofthedatadistributionchange(P(X)) as mentionedin their Section 2.1. In 2015, another comprehensive surveypaper [9] was published, which surveys and gives tutorialof both the established and the state-of-the-art approaches.It provides a hybrid-view about concept drift from two primary perspectives, active and passive. Both survey pa-pers are comprehensive and can be a good introductionto concept drift researching.However,many new publica-tions have become available in the last three years,even a new category of drift detection methods has a risen,named multiple hypothesis tests drift detection. It is necessary toreview the past research focuses and give the most recentresearch trends about concept drift, which is one of the maincontribution of this survey paper.

在文献中，2014年发表了一份详细的概念漂移调查论文[8]，但有意将概念漂移的某些子问题留给其他出版物，如第2.1节中提到的数据分布变化（P（X））详细说明。2015年，又出版了一份综合调查报告[9]，对既有和最先进的进行了调查和指导接近。它提供一个关于概念漂移的混合视图two 主动和被动的观点。这两份调查报告都很全面，可以很好地介绍概念漂移研究。但是许多新的出版物在三年前都是可用的，甚至还有一个新的缺陷检测方法，称为多个假设的概念漂移检测。有必要回顾过去的研究热点，给出概念漂移的最新研究趋势，这也是本文的主要贡献之一。

Besides these two publications, four related survey pa-pers [6], [7], [10], [11] have also provided valuable insightsinto how to address concept drift, but their specific researchfocus is only on data stream learning, rather than analyz-ing concept drift adaptation algorithms and understand-ing concept drift. Specifically, paper [7] focuses on datareduction for stream learning incorporating concept drift,while [6] only focuses on investigating the development inlearning ensembles for data stream learning in a dynamicenvironment. [11] concerns the evolution of data streamclustering, and [10] focuses on investigating the current andfuture trends of data stream learning. There is therefore agap in the current literature that requires a fuller pictureof established and the new emerged research on conceptdrift; a comprehensive review of the three major aspectsof concept drift: concept drift detection, understanding andadaptation, as shown in Fig. 2; and a discussion about thenew trend of concept drift research.

除了这两个出版物外，四个相关的调查文献[6]、[7]、[10]、[11]也为如何解决概念漂移提供了有价值的见解，但他们的具体研究重点仅限于数据流学习，而不是分析概念漂移适应算法和理解概念漂移。具体地说，论文[7]侧重于结合概念漂移的流学习的数据简化，而[6]只关注动态环境下数据流学习的集成学习的发展。[11] 关注数据流聚类的发展，并且[10]关注数据流学习的当前和未来趋势。因此，在当前的文献中有gap，它要求对概念漂移的已建立和新出现的研究有一个更全面的描述；对概念漂移的三个主要方面：概念漂移的检测、理解和适应，如图2所示；并讨论概念漂移的新趋势概念漂移研究。

1)It perceptively summarizes concept drift researchachievements and clusters the research into three cat-egories: concept drift detection, understanding andadaptation, providing a clear framework for conceptdrift research development (Fig. 2);

2)It proposes a new component, concept drift under-standing, for retrieving information about the status ofconcept drift in aspects of when, how, and where. Thisalso creates a connection between drift detection anddrift adaptation;

3)It uncovers several very new concept drift techniques,such as active learning under concept drift and fuzzycompetence model-based drift detection, and identifiesrelated research involving concept drift;

4)It systematically examines two sets of concept driftdatasets, Synthetic datasets and Real-world datasets,through multiple dimensions: dataset description,availability, suitability for type of drift, and existingapplications;

5)It suggests several emerging research topics and poten-tial research directions in this area.

The remainder of this paper is structured as follows.In Section 2, the definitions of concept drift are givenand discussed. Section 3 presents research methods andalgorithms in concept drift detection. Section 4 discussesresearch developments in concept drift understanding. Re-search results on drift adaptation (concept drift reaction) arereported in Section 5. Section 6 presents evaluation systemsand related datasets used to test concept drift algorithms.Section 7 summaries related research concerning the conceptdrift problem. Section 8 presents a comprehensive analysisof main findings and future research directions.

1）它感性地总结了概念漂移的研究成果，并将研究分为三大类：概念漂移检测、理解和适应，为概念漂移研究的发展提供了一个清晰的框架（图2）；

2）提出了一个新的组成部分，概念漂移的理解和适应，用于检索有关概念漂移状态的信息，包括时间、方式和位置。这也在漂移检测和漂移适应之间建立了联系；

3）揭示了一些非常新的概念漂移技术，如概念漂移下的主动学习和基于模糊比较模型的漂移检测，并确定了涉及概念漂移的相关研究；

4）系统地考察了两组概念漂移数据集、合成数据集和真实世界数据集，通过多个维度：数据集描述、可用性、漂移类型的适用性和现有应用；

5）在此基础上，提出了一些新的研究课题和可能的研究方向区域。

那个本文其余部分的结构如下跟在后面第二节，给出并讨论了概念漂移的定义。第三部分介绍了概念漂移检测的研究方法和算法。第4节讨论了概念漂移理解的研究进展。关于漂移适应（概念漂移反应）的研究结果见第5节。第6节介绍了用于测试概念漂移的评估系统和相关数据集算法.章节7综述了概念漂移问题的相关研究。第八部分对主要发现和未来的研究方向进行了综合分析。

Problem Description

Concept drift definition and the sources

Concept drift is a phenomenon in which the statistical prop-erties of a target domain change over time in an arbitraryway [3]. It was first proposed by [12] who aimed to pointout that noise data may turn to non-noise information atdifferent time. These changes might be caused by changesin hidden variables which cannot be measured directly [4].Formally, concept drift is defined as follows:Given a time period[0,t], a set of samples, de-noted asS0,t={d0,…,dt}, wheredi= (Xi,yi)isone observation (or a data instance),Xiis the fea-ture vector,yiis the label, andS0,tfollows a certaindistributionF0,t(X,y).Conceptdriftoccursattimes-tampt+ 1,ifF0,t(X,y)6=Ft+1,∞(X,y),denotedas∃t:Pt(X,y)6=Pt+1(X,y)[2], [8], [13], [14]

概念漂移是一种目标域的统计特性随时间任意变化的现象[3]。它首先是由[12]提出的，目的是指出噪声数据在不同的时间会变成非噪声信息。这些变化可能是由于无法直接测量的隐藏变量的变化引起的[4]以下：给定一个时间段[0，t]，一组样本，记为0，t={d0，…，dt}，其中di=（Xi，yi）是一个观察（或一个数据实例），Xi是特征向量，yi是标签，s0，t遵循一定的分布f0，t（X，y）。概念漂移发生的时间为tampt+1，ifF0，t（X，y）6=Ft+1，∞（X，y），表示为∃t:Pt（X，y）6=Pt+1（X，y）[2]，[8]，[13]，[14]。

Concept drift has also been defined by various authorsusing alternative names, such as dataset shift [15] or conceptshift [1]. Other related terminologies were introduced in[16]’s work, the authors proposed that concept drift or shiftis only one subcategory of dataset shift and the datasetshift is consists of covariate shift, prior probability shift and concept shift. These definitions clearly stated the researchscope of each research topics. However, since concept drift isusually associated with covariate shift and prior probabilityshift, and an increasing number of publications [2], [8], [13],[14] refer to the term ”concept drift” as the problem inwhich∃t:Pt(X,y)6=Pt+1(X,y). Therefore, we apply thesame definition of concept drift in this survey. Accordingly,concept drift at timetcan be defined as the change ofjoint probability of Xandyat timet. Since the joint probabilityPt(X,y)can be decomposed into two parts asPt(X,y) =Pt(X)×Pt(y|X), concept drift can be triggeredby three sources.

概念漂移也由不同的作者使用其他名称定义，比如dataset shift[15]或concept shift[1]。文献[16]还引入了其他相关术语，提出概念漂移或移位只是数据集移位的一个子范畴，数据集移位由协变量移位、先验概率移位和观念转变。这些定义清楚地说明了每个研究课题的研究范围。然而，由于概念漂移通常与协变量偏移和先验概率偏移有关，越来越多的文献[2]、[8]、[13]、[14]将“概念漂移”称为∃t:Pt（X，y）6=Pt+1（X，y）的问题。因此，我们在本次调查中采用了相同的概念漂移定义。因此，时间漂移可以定义为x和y时刻联合概率的变化。由于jointprobabilityPt（X，y）可以分解为asPt（X，y）=Pt（X）×Pt（y | X）两部分，因此概念漂移可以由三个源触发。

Source I:Pt(X)6=Pt+1(X)whilePt(y|X) =Pt+1(y|X), that is, the research focus is the drift inPt(X)whilePt(y|X)remains unchanged. SincePt(X)drift does not affect the decision boundary, it has alsobeen considered as virtual drift [7], Fig. 3(a).•Source II:Pt(y|X)6=Pt+1(y|X)whilePt(X) =Pt+1(X)whilePt(X)remains unchanged. This driftwill cause decision boundary change and lead to learn-ing accuracy decreasing, which is also called actual drift,Fig. 3(b).•Source III: mixture of Source I and Source II, namelyPt(X)6=Pt+1(X)andPt(y|X)6=Pt+1(y|X). Conceptdrift focus on the drift of bothPt(y|X)andPt(X),since both changes convey important information aboutlearning environment Fig. 3©.Fig. 3 demonstrates how these sources differ from eachother in a two-dimensional feature space. Source I is featurespace drift, and Source II is decision boundary drift. Inmany real-world applications, Source I and Source II occurtogether, which creates Source III

来源一：Pt（X）6=Pt+1（X）whilePt（y | X）=Pt+1（y | X），即研究的焦点是漂移输入（X），而漂移输入（y | X）保持不变。由于（X）漂移不影响判决边界，因此它也被视为虚拟漂移[7]，图3（a）。•源二：Pt（y | X）6=Pt+1（y | X），其中Pt（X）=Pt+1（X），而lept（X）保持不变。这种漂移会引起决策边界的变化，并导致学习精度下降，这也被称为实际漂移，图3（b）；•震源III：震源I和震源II的混合，即Pt（X）6=Pt+1（X）和Pt（y | X）6=Pt+1（y | X）。

Concept drift关注的是hpt（y | X）和pt（X）的漂移，因为这两种变化都传达了关于学习环境的重要信息图3（c）。图3展示了这些来源在二维特征空间中的区别。源I为特征空间漂移，源II为判定边界漂移。在许多真实世界的应用程序中，源代码I和源代码II占用者创建了源代码III

Concept Drift Detection

A general framework for drift

Drift detection refers to the techniques and mechanismsthat characterize and quantify concept drift via identifyingchange points or change time intervals [17]. A generalframework for drift detection contains four stages, as shownin Fig. 5.Stage 1 (Data Retrieval) aims to retrieve data chunksfrom data streams. Since a single data instance cannot carryenough information to infer the overall distribution [2],knowing how to organize data chunks to form a meaningfulpattern or knowledge is important in data stream analysistasks [7].Stage 2 (Data Modeling) aims to abstract the retrieveddata and extract the key features containing sensitive infor-mation, that is, the features of the data that most impacta system if they drift. This stage is optional, because itmainly concerns dimensionality reduction, or sample sizereduction, to meet storage and online speed requirements[4].Stage 3 (Test Statistics Calculation) is the measurement ofdissimilarity, or distance estimation. It quantifies the sever-ity of the drift and forms test statistics for the hypothesistest. It is considered to be the most challenging aspect ofconcept drift detection. The problem of how to define anaccurate and robust dissimilarity measurement is still anopen question. A dissimilarity measurement can also be used in clustering evaluation [11], and to determine thedissimilarity between sample sets [18].Stage 4 (Hypothesis Test) uses a specific hypothesis testto evaluate the statistical significance of the change observedin Stage 3, or the p-value. They are used to deter