1. 概念漂移(concept drift)
背景:概念漂移指的是数据流中的潜在数据分布随时间发生不可预测的变化,使原有的分类器分类不准确或决策系统无法正确决策,常见于推荐系统、金融领域、决策等
Concept drift refers to unforeseeable changes in the underlying data distribution of data streams over time.
定义:Concept drift in machine learning and data mining refers to the change in the relationships between input and output data in the underlying problem over time. (https://machinelearningmastery.com/gentle-introduction-concept-drift-machine-learning/)
我的理解:目标函数target随时间发生不可预测性变化。比如:input(x1) --> target(x1) 概念漂移: input(x1) --> target(x2).
2. 概念漂移检测(concept drift detection method)
2.1 Supervised learning method
The supervised learning method usually depends on the underlying data distribution to compute the classification error rate, relative entropy, linear four rates(true positive rate, true negative rate, positive predictive value and negative predictive value). Although these methods can get high accuracy, they over-rely the distribution of underlying data and labeled data.
2.2 Unsupervised learning method
The unsupervised learning method usually computes the difference of adjacent data block to confirm whether the concept drift occurs, such as: the distance of topic feature space, the similarity of feature of time series and the Fuzzy Competence Model.
Although these methods don’t need prior knowledge of the underlying data and can output when, how, where concept drift occurs, the semantic information were absent to the detection of concept drift. A small number of samples will limit the application of unsupervised learning method.
3. Expected Method
core: 利用语义信息和算法表征不同的概念,进行相似度比较,如果不同名称的概念相似,则它们发生了概念漂移,因为它们的语义本质没有发生变化,e.g. 计算机和电脑,如果它发生了概念漂移,但它们的本质都是指代同一件事物。
To overcome the above limitations, I proposed a concept drift detection method based on semantic folding. Semantic folding can represent the semantic information and the of underlying context data by generating 128/256-bit hash vector. It will be more advantageous than topic feature space and maximum likelihood estimation to detect the concept drift. The following is the method steps:
(1) an initial sematic folding vector v1 extracted from original underlying data. (2) generate a new semantic folding vector v2 when new samples are available (3) compute the similarity or distance of two vectors v1 and v2.
(4) concept drift occurs when feature vectors differ significantly.
4. References
[1] FAN D, JIE L, GUANGQUAN Z, et al. Active fuzzy weighting ensemble for dealing with concept drift[J]. International journal of computational intelligence systems, 2018, 11: 438- 450.
[2] FAN D, GUANGQUAN Z, JIE L, et al. Fuzzy competence model drift detection for data- driven decision support systems(DSSs)[J]. Knowledge-Based systems, doi: 10.1016/j.knosys.2017.08.018.
[3] GUANG C, XUEGANG H, YUHONG Z. Semantic-based concept drift detection algorithm for data stream[J]. Computer Engineering, 2018, 44(2): 24-30.
[4] RODOLFO C, LEANDRO M, ADRIANO O. FEDD: Feature extraction for explicit concept drift detection in time series[C]. 2016 International joint conference on neural networks(IJCNN), 24-29/07/2016.
[5] SHUJIAN Y, ABRAHAM Z. Concept drift detection with hierarchical hypothesis testing[C]. 2017 SIAM International conference on data mining, 2017.