机器学习应用方向(二)~概念漂移(concept drift)

本文探讨了数据流中潜在数据分布随时间变化的概念漂移现象,分析了其在机器学习和数据挖掘中的影响,并介绍了监督学习和非监督学习两种检测方法。提出了基于语义折叠的概念漂移检测新方法,通过比较不同概念的语义相似度来识别概念漂移。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

1. 概念漂移(concept drift)

  背景:概念漂移指的是数据流中的潜在数据分布随时间发生不可预测的变化,使原有的分类器分类不准确或决策系统无法正确决策,常见于推荐系统、金融领域、决策等

    Concept drift refers to unforeseeable changes in the underlying data distribution of data streams over time. 

  定义:Concept drift in machine learning and data mining refers to the change in the relationships between input and output data in the underlying problem over time. (https://machinelearningmastery.com/gentle-introduction-concept-drift-machine-learning/)

  我的理解:目标函数target随时间发生不可预测性变化。比如:input(x1) --> target(x1) 概念漂移: input(x1) --> target(x2).

2. 概念漂移检测(concept drift detection method)

  2.1 Supervised learning method

The supervised learning method usually depends on the underlying data distribution to compute the classification error rate, relative entropy, linear four rates(true positive rate, true negative rate, positive predictive value and negative predictive value). Although these methods can get high accuracy, they over-rely the distribution of underlying data and labeled data.

        2.2 Unsupervised learning method

The unsupervised learning method usually computes the difference of adjacent data block to confirm whether the concept drift occurs, such as: the distance of topic feature space, the similarity of feature of time series and the Fuzzy Competence Model.

Although these methods don’t need prior knowledge of the underlying data and can output when, how, where concept drift occurs, the semantic information were absent to the detection of concept drift. A small number of samples will limit the application of unsupervised learning method.

3. Expected Method

core: 利用语义信息和算法表征不同的概念,进行相似度比较,如果不同名称的概念相似,则它们发生了概念漂移,因为它们的语义本质没有发生变化,e.g. 计算机和电脑,如果它发生了概念漂移,但它们的本质都是指代同一件事物。

To overcome the above limitations, I proposed a concept drift detection method based on semantic folding. Semantic folding can represent the semantic information and the of underlying context data by generating 128/256-bit hash vector. It will be more advantageous than topic feature space and maximum likelihood estimation to detect the concept drift. The following is the method steps:

(1) an initial sematic folding vector v1 extracted from original underlying data. (2) generate a new semantic folding vector v2 when new samples are available (3) compute the similarity or distance of two vectors v1 and v2.
(4) concept drift occurs when feature vectors differ significantly.

4. References

[1] FAN D, JIE L, GUANGQUAN Z, et al. Active fuzzy weighting ensemble for dealing with concept drift[J]. International journal of computational intelligence systems, 2018, 11: 438- 450.

[2] FAN D, GUANGQUAN Z, JIE L, et al. Fuzzy competence model drift detection for data- driven decision support systems(DSSs)[J]. Knowledge-Based systems, doi: 10.1016/j.knosys.2017.08.018.

[3] GUANG C, XUEGANG H, YUHONG Z. Semantic-based concept drift detection algorithm for data stream[J]. Computer Engineering, 2018, 44(2): 24-30.

[4] RODOLFO C, LEANDRO M, ADRIANO O. FEDD: Feature extraction for explicit concept drift detection in time series[C]. 2016 International joint conference on neural networks(IJCNN), 24-29/07/2016.

[5] SHUJIAN Y, ABRAHAM Z. Concept drift detection with hierarchical hypothesis testing[C]. 2017 SIAM International conference on data mining, 2017.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

天狼啸月1990

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值