【论文笔记(1)】A Survey on Concept Drift Adaptation

最新推荐文章于 2024-09-15 15:29:29 发布

该账户已不存在

最新推荐文章于 2024-09-15 15:29:29 发布

阅读量3.3k

点赞数 1

分类专栏：论文阅读笔记文章标签：人工智能

本文链接：https://blog.csdn.net/qq_42901861/article/details/121729644

版权

论文阅读笔记专栏收录该内容

5 篇文章 0 订阅

订阅专栏

Ⅰ 论文信息

本文 A Survey on Concept Drift Adaptation 是于2014年发表在 ACM Computing Survey 上的关于概念漂移的论文，引用次数超过一千，在该领域内具有重要意义。

Ⅱ 论文框架

In Section 2 we introduce the problem of concept drift, characterize adaptive learning algorithms and present motivating application examples. Section 3 presents a comprehensive taxonomy of methods for adaptive learning. Section 4 discusses the experimental settings and evaluation methodologies of adaptive learning algorithms. Section 5 concludes the survey.

1. INTRODUCTION

对于大容量的流数据而言，只有 online processing 比较合适
- 此时，predictive models可以通过两种方式进行预测
  - trained incrementally by continuous update
  - retraining using recent batches of data
本篇综述侧重于online supervised learning上的概念漂移问题

2. ADAPTIVE LEARNING ALGORITHMS

对于动态变化的环境，算法的一个很重要的性质就是 the ability of incorporating new data
Adaptive learning algorithms can be seen as advanced incremental learning algorithms that are able to adapt to evolution ofthe data generating process over time.

2.1. Setting and definitions

offline learning v.s. online learning
- 在offline learning中，训练时要能获得所有的训练数据，只有在训练结束后模型才能用于预测
- online algorithms顺序地处理数据，模型初始时不能获得完整的训练数据，它时随着训练数据的不断到来而被不断更新的
incremental algorithms
- 比online algorithms受限制更少，数据的顺序是random而不一定是sequential的
- 以 one-by-one 或 batch-by-batch 的方式处理输入数据并更新模型
streaming algorithms
- online algorithms for processing high-speed continuous flows of data
- 数据是被顺序处理的
- 算法使用有限的内存和有限的处理时间
concept drift 的定义
- 可分为 ‘virtual drift 和 real concept drift 两种
  - Real concept drift refers to changes in p(y|X)
  - Virtual drift happens if the distribution of the incoming data changes (i.e., p(X) changes) without affecting p(y|X)

2.2. Changes in data over time

Changes分为四种，要注意change与outlier的区别。
在这里插入图片描述

2.4. Online adaptive learning procedure

Online adaptive learning procedure 主要分为三个步骤：
（1）Predict
  When new example $X_t$ arrives, a prediction $\hat{y}_t$ is made using the current model $L_t$ .
（2）Diagnose
  After some time we receive the true label $y_t$ and can estimate the loss as $f(\hat{y}_t, y_t)$ .
（3）Update
  We can use the example $X_t, y_t)$ for model update to obtain $L_{t+1}$ .

在这里插入图片描述
图3中展示了a generic schema for an online adaptive learning algorithm。

Memory模块决定哪一些data以何种方式呈现给算法 (learning module)。
Loss estimation模块跟踪算法的performance，并在必要时发送信息给Change detection模块来更新模型。

2.5. Illustrative applications

在推荐系统中，items与users都会随着时间变化。

item-side effects：变化的产品定位和受欢迎程度
- 产品的受欢迎成都有时会遵循seasonal patterns
user-side effects：变化的用户偏好

3. TAXONOMY OF METHODS FOR CONCEPT DRIFT ADAPTATION

The discussion is organized around four modules of adaptive learning algorithms that were identified in Figure 3: memory, change detection, learning, and loss estimation.

图4中展示本节的overview。

3.1. Memory

在这里插入图片描述

在概念漂移下学习不仅要用新的信息更新模型，还要让模型忘记旧的信息。

有两种形式的memory：

short term memory：represented as data
long term memory：represented as generalization of data - model

本小节集中于从两个维度分析short term memory。

3.1.1. Data Management

ASSUMPTION:
The most recent data is the most informative for the current prediction.

Single Example

仅在memory中存储一个single example
因为online learning algorithms没有explicit forgetting mechanisms，它虽然可以应对slow changes，但难以应对abrupt changes

Multiple Examples

maintain a predictive model consistent with a set of recent examples
如FLORA，很多算法会用到training window来进行模型的更新
- sliding windows of a fixed size
  - oldest data被抛弃
  - 存储被用于当作evaluation中的baseline
- sliding windows of variable size
  - 随着时间，依赖change detector，变化window中的数据个数

3.1.2. Forgetting Mechanism

最常见的应对未知的动态变化的环境方法——forgetting the outdated observations。
在 reactivity 以及 robustness to noise 之间需要tradeoff，忘的越快，反应力越快，但是捕捉到噪声的风险也会更高。

Abrupt Forgetting

Abrupt forgetting, or partial memory, refers to the mechanisms where a given observation is either inside or outside the training window.
此类文献中有很多window models，可分为以下两类
- sequence based
  - sliding windows of size j
  - landmark windows（存储一个给定时间戳之后的所有数据）
- timestamp based
  - timestamp based window of size t（存最近的t个单位时间内的数据）
除了windowing的方法之外，还有sampling的方法
- 目标：summarize the underlying characteristics of a data stream uniformly

Gradual Forgetting

在gradual forgetting中，没有examples会被从memory中彻底地抛弃
Examples in memory are associated with weights that reflect their age.

3.2. Change Detection

Change detection module characterizes and quantifies concept drift by identifying change points or small time intervals during which changes occur.
在这里插入图片描述

online learning已经具备自适应环境的性质了，而explicit change detection可以为产生data的动态过程提供信息。常见的方法时通过 performance indicators（e.g. accuracy, recall and precision）判断change是否发生。

3.2.1. Detectors based on Sequential Analysis

Sequential Probability Ratio Test (SPRT)
- basis for some change detection algorithms
The cumulative sum (CUSUM)
The Page-Hinkley test (PH)
- sequential adaptation of the detection of an abrupt change

3.2.2. Detectors based on Statistical Process Control

Control charts, or Statistical Process Control (SPC), are standard statistical techniques to monitor and control the quality of a product during a continuous manufacturing.

SPC based systems 具有三种状态：
（1）In-Control
（2）Out-of-Control
（3）Warning

经典算法举例：The exponentially weighted moving average (EWMA)

3.2.3. Monitoring distributions on two different time-windows

These methods typically uses a fixed reference window that summarizes the past information, and a sliding detection window over the most recent examples. Distributions over these two windows are compared using statistical tests with the null hypothesis stating that the distributions are equal. If the null hypothesis is rejected, a change is declared at the start of the detection window.

经典算法：

VFDTc
[Vorburger and Bernstein 2006]
- an entropy-based metric to measure the distribution inequality between two sliding windows
[Sebasti˜ao and Gama 2007]
- uses the Kullback-Leibler (KL) divergence to measure the distance between the probability distributions oftwo different windows (old and recent) to detect possible changes.
[Bach and Maloof 2008]
- uses two models: stable and reactive
ADWIN

PROs and CONs of detection methods based on two windows:

PROs
- more precise localization of the change point
CONs
- memory requirements compared to sequential detectors

3.2.4. Contextual approaches

Three mechanisms: staleness, penalization and overall accuracy

3.3. Learning

The Learning component refers to the techniques and mechanisms for generalizing from examples and updating the predictive models from evolving data.
在这里插入图片描述

3.3.1. Learning Mode

Retraining - discards the current model and builds a new model from scratch

need some data buffer to be stored in memory
At the beginning a model is trained with all the available data.
Next, whenever new data arrives, the previous model is discarded, the new data is merged with the previous data, and a new model is learned on this data.

Incremental - adaptation updates of the model

Incremental 方法用最近的数据更新模型，它可以分为三种算法：
- Incremental algorithms process input examples one-by-one and update the sufficient statistics stored in the model after receiving each example.
- The Online learning mode updates the current model using the most recent data.
- In streaming algorithms, example are processed sequentially using limited memory.

3.3.2. Adaptation Methods

Blind adaptation strategies

blind adaptation 方法不用任何 explicit detection of changes 就更新model
通常用的技术是固定大小的滑动窗口，周期性地 retrain the model
特殊的例子是会随着数据进化的 incremental & online learning
主要缺点：slow reaction to concept drift in data

Informed adaptation strategies

informed方法是reactive的，取决于whether a trigger has been flagged
在这种方法中，对于一个drift signal的反应可以分为两种：
- Global Replacement
  - require full reconstruction of the model
  - the most radical reaction to a drift
- Local Replacement
  - In many cases changes occur only in some regions of the data space.
  - Granular models粒度模型, such as decision rules or decision trees, can adapt parts of the model.
  - 经典算法：CVDFT（决策树）

3.3.3. Model Management

Emsemble learning

在内存中保存an ensemble of multiple models that make a combined prediction
ASSUMPTION
- during a change, data is generated from a mixture distribution, namely a weighted combination of distributions characterizing the target concepts
  - The weights change over time.
- each individual model in the ensemble models one of the distributions
ensemble methods for dynamically changing data 可以被分成三类：
- dynamic combinatiom
  - by modifying the combination rule
- continuous update of the learners
  - learners are either retrained in a batch mode or updated online using new data
- structural update
  - new learners are added and inefficient ones are removed
经典算法：Dynamic Weighted Majority algorithm (DWM)
- DWM maintains an ensemble of predictive models, each with an associated weight.
- DWM dynamically creates and deletes experts in response to changes in performance.

Reoccurring concept management

FLORA3 is the first adaptive learning technique for the tasks where concepts may reoccur over time.
[Gama and Kosina 2011]

3.4. Loss estimation

Supervised adaptive systems rely loss estimation based on environment feedback.
在这里插入图片描述

3.5. Discussion

Supervised adaptive learning algorithms rely on immediate arrival of feedback.
However, feedback can come with an uncontrollable delay, be unreliable, biased or costly.

4. EVALUATION

4.1. Performance evaluation metrics

Performance evaluation metrics may be selected from the traditional accuracy measures, such as

precision and recall or their weighed average in retrieval tasks,
sensitivity and specificity or their weighed average in medical diagnostics,
mean absolute scaled errors in regression or time-series prediction tasks,
root mean square error in recommender systems.

It is important to consider appropriate reference points or baseline approaches in particular settings.
In addition, taking into account practical considerations of the streaming settings, we may consider the following measures:

A measure for the computational cost ofthe mining process
A statistic for class taking into account class imbalance

Besides evaluating the performance of the learning strategy we may like to assess the accuracy of change detection separately for those strategies that employ explicit drift detection as part of the concept drift handling strategy.

Probability of true change detection
Probability of false alarms
- Instead of reporting the commonly used false positive rate for change detections, in the streaming settings it is more convenient to use the inverse of the time to detection or the average run length
Delay of detection

4.2. Experimental design

问题：常用的cross-validation不适用于流数据因为这样会 mix the temporal order of data
One solution involves taking snapshots at different times during the induction ofa model to see how much the model improves.

4.2.1. Evaluation on time-ordered data

Holdout

cross-validation太time-consuming，选择用 a single holdout set 来检测表现
When testing a model at time $t$ , the holdout set represents exactly the same context at that time $t$ .
- 因此，The loss estimated in the holdout is an unbiased estimator
- 但是，很难知道时间 $t$ 处的concept到底是什么，所以很难获取holdout set

Interleaved Test-Then-Train or Prequential

Each individual instance can be used to test the model before it is used for training, and from this the accuracy can be incrementally updated.
这种方法可以 make maximum use of the available data

Controlled permutations

Averaging the accuracy over time 会可能导致一个问题 - mask the adaptation properties of adaptive learning algorithms
- Even the prequential evaluation may produce biased results towards the fixed order of data in a sequence, as it runs only one test in a fixed order of data.
To reduce this risk, controlled permutations evaluation [Zliobaite 2011b] runs multiple tests with randomized copies of a data stream
- Randomization aims at keeping the instances, that were originally nearby in time, close together.

4.2.2. Cross-validation with aligned series of data

我们想要adaptive learning technique学习到的不是 a single adaptive model for the data stream with an individual object，而是 multiple models, one adaptive model per object。
因为有时候 different data streams would be alike, i.e. sharing the same feature space or being of the same nature，所以用这些不同的data streams来评测模型关于multiple objects的性能是合理的。
在这里插入图片描述

4.2.3. Statistical significance

McNemar test
- used in the stream mining
- assessing the differences in performance of two classifiers
Nemenyi test
- comparing more than two classifiers
Dunnett test
- all classifiers are compared to a control classifier