Learning under Concept Drift: A Review


Concept drift describes unforeseeable changes in the underlying distribution of streaming data over time. Concept driftresearch involves the development of methodologies and techniques for drift detection, understanding and adaptation. Data analysishas revealed that machine learning in a concept drift environment will result in poor learning results if the drift is not addressed. To helpresearchers identify which research topics are significant and how to apply related techniques in data analysis tasks, it is necessarythat a high quality, instructive review of current research developments and trends in the concept drift field is conducted. In addition,due to the rapid development of concept drift in recent years, the methodologies of learning under concept drift have becomenoticeably systematic, unveiling a framework which has not been mentioned in literature. This paper reviews over 130 high qualitypublications in concept drift related research areas, analyzes up-to-date developments in methodologies and techniques, andestablishes a framework of learning under concept drift including three main components: concept drift detection, concept driftunderstanding, and concept drift adaptation. This paper lists and discusses 10 popular synthetic datasets and 14 publicly availablebenchmark datasets used for evaluating the performance of learning algorithms aiming at handling concept drift. Also, concept driftrelated research directions are covered and discussed. By providing state-of-the-art knowledge, this survey will directly supportresearchers in their understanding of research developments in the field of learning under concept drift.



GOVERNMENTSand companies are generating hugeamounts of streaming data and urgently need efficientdata analytics and machine learning techniques to sup-port them making predictions and decisions. However, therapidly changing environment of new products, new mar-kets and new customer behaviors inevitably results in theappearance of concept drift problem. Concept drift meansthat the statistical properties of the target variable, which themodel is trying to predict, change over time in unforeseenways [1]. If the concept drift occurs, the induced pattern ofpast data may not be relevant to the new data, leading topoor predictions and decision outcomes. The phenomenonof concept drift has been recognized as the root cause ofdecreased effectiveness in many data-driven informationsystems such as data-driven early warning systems anddata-driven decision support systems. In an ever-changingand big data environment, how to provide more reliabledata-driven predictions and decision facilities has become acrucial issue.



Concept drift problemexists in many real-world situations.Anexamplecanbeseeninthechangesofbehaviorinmobilephoneusage,asshowninFig.1.Fromthebarsinthisfigure,thetimepercentagedistributionofthemobilephoneusagepatternhaschangedfrom“AudioCall”to“Camera”andthento“MobileInternet”overthepasttwodecades


Recent attractive research in the field of concept drifttargets more challenging problems, i.e., how to accuratelydetect concept drift in unstructured and noisy datasets [2],[3], how to quantitatively understand concept drift in aexplainable way [4], [5], and how to effectively react to driftby adapting related knowledge [6], [7]


Solving these challenges endows prediction and decision-making with the adaptability in an uncertain envi-ronment. Conventional research related to machine learninghas been significantly improved by introducing conceptdrift techniques in data science and artificial intelligence in general, and in pattern recognition and data stream miningin particular. These new studies enhance the effectiveness ofanalogical and knowledge reasoning in an ever-changingenvironment. A new topic is formed during this devel-opment: adaptive data-driven prediction/decision systems.In particular, concept drift is a highly prominent and sig-nificant issue in the context of the big data era becausethe uncertainty of data types and data distribution is aninherent nature of big data


Conventional machine learning has two main compo-nents: training/learning and prediction. Research on learn-ing under concept drift presents three new components:drift detection (whether or not drift occurs), drift under-standing (when, how, where it occurs) and drift adaptation(reaction to the existence of drift) as shown in Fig. 2. Thesewill be discussed in Section 3-5.


In literature, a detailed concept drift survey paper [8]was published in 2014 but intentionally left certain sub-problems of concept drift to other publications, such asthedetailsofthedatadistributionchange(P(X)) as mentionedin their Section 2.1. In 2015, another comprehensive surveypaper [9] was published, which surveys and gives tutorialof both the established and the state-of-the-art approaches.It provides a hybrid-view about concept drift from two primary perspectives, active and passive. Both survey pa-pers are comprehensive and can be a good introductionto concept drift researching.However,many new publica-tions have become available in the last three years,even a new category of drift detection methods has a risen,named multiple hypothesis tests drift detection. It is necessary toreview the past research focuses and give the most recentresearch trends about concept drift, which is one of the maincontribution of this survey paper.

在文献中,2014年发表了一份详细的概念漂移调查论文[8],但有意将概念漂移的某些子问题留给其他出版物,如第2.1节中提到的数据分布变化(P(X))详细说明。2015年,又出版了一份综合调查报告[9],对既有和最先进的进行了调查和指导接近。它提供一个关于概念漂移的混合视图two 主动和被动的观点。这两份调查报告都很全面,可以很好地介绍概念漂移研究。但是许多新的出版物在三年前都是可用的,甚至还有一个新的缺陷检测方法,称为多个假设的概念漂移检测。有必要回顾过去的研究热点,给出概念漂移的最新研究趋势,这也是本文的主要贡献之一。


Besides these two publications, four related survey pa-pers [6], [7], [10], [11] have also provided valuable insightsinto how to address concept drift, but their specific researchfocus is only on data stream learning, rather than analyz-ing concept drift adaptation algorithms and understand-ing concept drift. Specifically, paper [7] focuses on datareduction for stream learning incorporating concept drift,while [6] only focuses on investigating the development inlearning ensembles for data stream learning in a dynamicenvironment. [11] concerns the evolution of data streamclustering, and [10] focuses on investigating the current andfuture trends of data stream learning. There is therefore agap in the current literature that requires a fuller pictureof established and the new emerged research on conceptdrift; a comprehensive review of the three major aspectsof concept drift: concept drift detection, understanding andadaptation, as shown in Fig. 2; and a discussion about thenew trend of concept drift research.

除了这两个出版物外,四个相关的调查文献[6]、[7]、[10]、[11]也为如何解决概念漂移提供了有价值的见解,但他们的具体研究重点仅限于数据流学习,而不是分析概念漂移适应算法和理解概念漂移。具体地说,论文[7]侧重于结合概念漂移的流学习的数据简化,而[6]只关注动态环境下数据流学习的集成学习的发展。[11] 关注数据流聚类的发展,并且[10]关注数据流学习的当前和未来趋势。因此,在当前的文献中有gap,它要求对概念漂移的已建立和新出现的研究有一个更全面的描述;对概念漂移的三个主要方面:概念漂移的检测、理解和适应,如图2所示;并讨论概念漂移的新趋势概念漂移研究。

1)It perceptively summarizes concept drift researchachievements and clusters the research into three cat-egories: concept drift detection, understanding andadaptation, providing a clear framework for conceptdrift research development (Fig. 2);

2)It proposes a new component, concept drift under-standing, for retrieving information about the status ofconcept drift in aspects of when, how, and where. Thisalso creates a connection between drift detection anddrift adaptation;

3)It uncovers several very new concept drift techniques,such as active learning under concept drift and fuzzycompetence model-based drift detection, and identifiesrelated research involving concept drift;

4)It systematically examines two sets of concept driftdatasets, Synthetic datasets and Real-world datasets,through multiple dimensions: dataset description,availability, suitability for type of drift, and existingapplications;

5)It suggests several emerging research topics and poten-tial research directions in this area.

The remainder of this paper is structured as follows.In Section 2, the definitions of concept drift are givenand discussed. Section 3 presents research methods andalgorithms in concept drift detection. Section 4 discussesresearch developments in concept drift understanding. Re-search results on drift adaptation (concept drift reaction) arereported in Section 5. Section 6 presents evaluation systemsand related datasets used to test concept drift algorithms.Section 7 summaries related research concerning the conceptdrift problem. Section 8 presents a comprehensive analysisof main findings and future research directions.

1) 它感性地总结了概念漂移的研究成果,并将研究分为三大类:概念漂移检测、理解和适应,为概念漂移研究的发展提供了一个清晰的框架(图2);




5) 在此基础上,提出了一些新的研究课题和可能的研究方向区域。


Problem Description

Concept drift definition and the sources

Concept drift is a phenomenon in which the statistical prop-erties of a target domain change over time in an arbitraryway [3]. It was first proposed by [12] who aimed to pointout that noise data may turn to non-noise information atdifferent time. These changes might be caused by changesin hidden variables which cannot be measured directly [4].Formally, concept drift is defined as follows:Given a time period[0,t], a set of samples, de-noted asS0,t={d0,…,dt}, wheredi= (Xi,yi)isone observation (or a data instance),Xiis the fea-ture vector,yiis the label, andS0,tfollows a certaindistributionF0,t(X,y).Conceptdriftoccursattimes-tampt+ 1,ifF0,t(X,y)6=Ft+1,∞(X,y),denotedas∃t:Pt(X,y)6=Pt+1(X,y)[2], [8], [13], [14]


Concept drift has also been defined by various authorsusing alternative names, such as dataset shift [15] or conceptshift [1]. Other related terminologies were introduced in[16]’s work, the authors proposed that concept drift or shiftis only one subcategory of dataset shift and the datasetshift is consists of covariate shift, prior probability shift and concept shift. These definitions clearly stated the researchscope of each research topics. However, since concept drift isusually associated with covariate shift and prior probabilityshift, and an increasing number of publications [2], [8], [13],[14] refer to the term ”concept drift” as the problem inwhich∃t:Pt(X,y)6=Pt+1(X,y). Therefore, we apply thesame definition of concept drift in this survey. Accordingly,concept drift at timetcan be defined as the change ofjoint probability of Xandyat timet. Since the joint probabilityPt(X,y)can be decomposed into two parts asPt(X,y) =Pt(X)×Pt(y|X), concept drift can be triggeredby three sources.

概念漂移也由不同的作者使用其他名称定义,比如dataset shift[15]或concept shift[1]。文献[16]还引入了其他相关术语,提出概念漂移或移位只是数据集移位的一个子范畴,数据集移位由协变量移位、先验概率移位和观念转变。这些定义清楚地说明了每个研究课题的研究范围。然而,由于概念漂移通常与协变量偏移和先验概率偏移有关,越来越多的文献[2]、[8]、[13]、[14]将“概念漂移”称为∃t:Pt(X,y)6=Pt+1(X,y)的问题。因此,我们在本次调查中采用了相同的概念漂移定义。因此,时间漂移可以定义为x和y时刻联合概率的变化。由于jointprobabilityPt(X,y)可以分解为asPt(X,y)=Pt(X)×Pt(y | X)两部分,因此概念漂移可以由三个源触发。


Source I:Pt(X)6=Pt+1(X)whilePt(y|X) =Pt+1(y|X), that is, the research focus is the drift inPt(X)whilePt(y|X)remains unchanged. SincePt(X)drift does not affect the decision boundary, it has alsobeen considered as virtual drift [7], Fig. 3(a).•Source II:Pt(y|X)6=Pt+1(y|X)whilePt(X) =Pt+1(X)whilePt(X)remains unchanged. This driftwill cause decision boundary change and lead to learn-ing accuracy decreasing, which is also called actual drift,Fig. 3(b).•Source III: mixture of Source I and Source II, namelyPt(X)6=Pt+1(X)andPt(y|X)6=Pt+1(y|X). Conceptdrift focus on the drift of bothPt(y|X)andPt(X),since both changes convey important information aboutlearning environment Fig. 3©.Fig. 3 demonstrates how these sources differ from eachother in a two-dimensional feature space. Source I is featurespace drift, and Source II is decision boundary drift. Inmany real-world applications, Source I and Source II occurtogether, which creates Source III

来源一:Pt(X)6=Pt+1(X)whilePt(y | X)=Pt+1(y | X),即研究的焦点是漂移输入(X),而漂移输入(y | X)保持不变。由于(X)漂移不影响判决边界,因此它也被视为虚拟漂移[7],图3(a)。•源二:Pt(y | X)6=Pt+1(y | X),其中Pt(X)=Pt+1(X),而lept(X)保持不变。这种漂移会引起决策边界的变化,并导致学习精度下降,这也被称为实际漂移,图3(b);•震源III:震源I和震源II的混合,即Pt(X)6=Pt+1(X)和Pt(y | X)6=Pt+1(y | X)。

Concept drift关注的是hpt(y | X)和pt(X)的漂移,因为这两种变化都传达了关于学习环境的重要信息图3(c)。图3展示了这些来源在二维特征空间中的区别。源I为特征空间漂移,源II为判定边界漂移。在许多真实世界的应用程序中,源代码I和源代码II占用者创建了源代码III

Concept Drift Detection

*A general framework for drift *

Drift detection refers to the techniques and mechanismsthat characterize and quantify concept drift via identifyingchange points or change time intervals [17]. A generalframework for drift detection contains four stages, as shownin Fig. 5.Stage 1 (Data Retrieval) aims to retrieve data chunksfrom data streams. Since a single data instance cannot carryenough information to infer the overall distribution [2],knowing how to organize data chunks to form a meaningfulpattern or knowledge is important in data stream analysistasks [7].Stage 2 (Data Modeling) aims to abstract the retrieveddata and extract the key features containing sensitive infor-mation, that is, the features of the data that most impacta system if they drift. This stage is optional, because itmainly concerns dimensionality reduction, or sample sizereduction, to meet storage and online speed requirements[4].Stage 3 (Test Statistics Calculation) is the measurement ofdissimilarity, or distance estimation. It quantifies the sever-ity of the drift and forms test statistics for the hypothesistest. It is considered to be the most challenging aspect ofconcept drift detection. The problem of how to define anaccurate and robust dissimilarity measurement is still anopen question. A dissimilarity measurement can also be used in clustering evaluation [11], and to determine thedissimilarity between sample sets [18].Stage 4 (Hypothesis Test) uses a specific hypothesis testto evaluate the statistical significance of the change observedin Stage 3, or the p-value. They are used to deter

