[KDD 2023]WHEN: A Wavelet-DTW Hybrid Attention Network for Heterogeneous Time Series Analysis

论文网址:WHEN: A Wavelet-DTW Hybrid Attention Network for Heterogeneous Time Series Analysis | Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

英文是纯手打的!论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误,若有发现欢迎评论指正!文章偏向于笔记,谨慎食用

目录

1. 心得

2. 论文逐段精读

2.1. Abstract

2.2. Introduction

2.3. Related Work

2.4. Data-Dependent Wavelet Attention

2.4.1. Wavelet-based Frequency Analysis

2.4.2. Data-dependent Wavelet Attention for Dynamic Frequency Analysis

2.5. Dynamic Time Warping Attention

2.5.1. Dynamic Time Warping

2.5.2. Local Dynamic Time Warping Attentions

2.5.3. The Model Output Layer

2.6. Experiments

2.6.1. Task I: Time Series Classification

2.6.2. Task II:Time Series Forecasting

2.6.3. Exploratory Analysis

2.7. Conclusion

3. 知识补充

3.1. dynamic time warping (DTW)

3.2. Nemenyi test

4. Reference


1. 心得

(1)真的非常讨厌用莫名其妙花体写论文的人。有些花体首先是根本不认识,其次是公式识别容易识别成别的字母,再其次是CSDN不支持那种类型的花体。都是看分析了干嘛要这么花里胡哨的?能不能不要扭来扭去?

(2)吐槽归吐槽,不要上升到文章。好吧,作者把原本机器学习的DTW迁移到了深度学习,挺好的做法

2. 论文逐段精读

2.1. Abstract

        ①Challenge in time series analysis: heterogeneity of data

        ②Two types of heterogeneity: intra-sequence nonstationarity (I NAME it ISN) and inter-sequence asynchronism (and I NAME it ISA)(作者总是写全称太长了我靠我手打不容易啊)

        ③Solution: proposed a WHEN framework which contains 2 attention mechanims

asynchronism  n. 不同时;异步性

2.2. Introduction

        ①Task types for time series analysis (TSA):  time series classification (TSC) and time series forecasting (TSF)

        ②ISN attributes to inherently heterogeneous, such as mean, variance, frequency components, etc. And ISA is due to heterogeneous sampling rate or phase perturbation. They often solved by Dynamic time Warping (DTW) algorithm, but it's not deep learning method

        ③Taking ECG as example, the intra-sequence nonstationarity and inter-sequence asynchronism are like:

        ④Limitations of CNN and RNN on two hetero characters: the structures of deep model are repetitive(作者觉得重复的卷积核不能预测异质的东西,不过现在很多也不用普通卷积了嘛)

        ⑤Framework of Wavelet-DTW Hybrid attEntion Networks (WHEN):

2.3. Related Work

(1)Heterogeneous Time Series Analysis

        ①Traditional way: manually transform non-stationary time series to stationary

        ②Lists other DL methods

(2)Time Series Classification (TSC)

        ①3 types of methods: distance based methods, feature based methods and ensemble methods

(3)TimeSeriesForecasting(TSF)

        ①TFS relies on autoregressive correlation

2.4. Data-Dependent Wavelet Attention

2.4.1. Wavelet-based Frequency Analysis

        ①Define a mother wavelet function:

\psi_{\alpha,\tau}(t)=\frac{1}{\sqrt{\alpha}}\psi\left(\frac{t-\tau}{\alpha}\right)

where \alpha \in \mathbb{R}^+ denotes dilated scalar (frequency band), \tau \in \mathbb{R} is shift

        ②For sequential signal f\left ( t \right ) and base \psi_{\alpha,\tau}(t), the WT is:

r_{\alpha,\tau}=\int_{-\infty}^{+\infty}f(t) \frac{1}{\sqrt{\alpha}}\psi\left(\frac{t-\tau}{\alpha}\right) \mathrm{d}t

where r_{\alpha,\tau} is component intensity of frequency band \frac{1}{\alpha} at location \tau

2.4.2. Data-dependent Wavelet Attention for Dynamic Frequency Analysis

        ①They design a data-dependent wavelet attention mechanism (WaveAtt):

        ②The original input: a multivariate time series X=(x_1,x_2,\ldots,x_i,\ldots,x_I)^\top, where x_i \in \mathbb{R}^{K_X} denotes K_X dimensional vector

        ③They encode X by Bidirectional Long Short Term Memory (BiLSTM) to hidden sequence S=\left ( s_1,...,s_i,...,s_I \right ), it can capture the forward and backward temporal dependencies

        ④The input of WaveAtt: S^{(k)}=\left(s_{1}^{(k)},\ldots,s_{i}^{(k)},\ldots,s_{I}^{(k)}\right) where k denotes dimension

        ⑤Through sliding window of length 2L+1, an example sequence cetered on time step i:

s_{i}^{(k)}=\begin{pmatrix}s_{i-L}^{(k)},\ldots,s_{i-1}^{(k)}, s_{i}^{(k)}, s_{i+1}^{(k)},\ldots,s_{i+L}^{(k)}\end{pmatrix}

        ⑥And they set

\alpha(S_i)=\mathrm{ReLU}\left(w_b+\sum_{l=-L}^Lw_l\cdot s_{i+l}\right)+\epsilon

where \mathbf{w}=\left ( w_{-L},...,w_L,w_b \right )^T are trainable parameters, \epsilon is a small value, \alpha is positive

        ⑦Data-dependent wavelet function:

\psi_{S_{i}}(t)=\frac{1}{\sqrt{\alpha\left(S_{i}\right)}}\psi\left(\frac{t-i}{\alpha\left(S_{i}\right)}\right)

compared to common WT, fixed \alpha is replaced by \alpha(S_i) and \tau =i

        ⑧The attention weights:

\mathrm{ATT}\left(\psi_{Si}\left(t\right)\right)=\frac{\psi_{Si}\left(t\right)}{\sum_{\tau=i-L}^{i+L}\left|\psi_{Si}\left(\tau\right)\right|}

        ⑨The output of BiLSTM:

r_{i}=\sum_{t=i-L}^{i+L}\mathrm{ATT}\left(\psi_{Si}(t)\right)\cdot s_{t}

        ⑩To get diverse frequency components, they introduce \Gamma wavelet families, in time step i, the frequency components are:

R_i=\left ( r_{i,1},...,r_{i,\Gamma } \right )

        ⑪The final output of WaveAtt:

\mathcal{R}=\left ( R_1,...,R_i,...,R_I \right )

2.5. Dynamic Time Warping Attention

2.5.1. Dynamic Time Warping

        ①For two sequences, P=\left ( p_1,...,p_m,...,p_M \right ) and Q=\left ( q_1,...,q_n,...,q_N \right ), Dynamic Time Warping (DTW) calculates:

H=\left( \left(p_{m_{1}},q_{n_{1}}\right),\ldots,\left(p_{m_{z}},q_{n_{z}}\right),\ldots,\left(p_{m_{Z}},q_{n_{Z}}\right)\right)

where m_1=n_1=1m_Z=Mn_Z=N0\leq m_{z+1}-m_{z}\leq10\leq n_{z+1}-n_{z}\leq1

        ②The distance between two sequences:

d_H=\sum_{z=1}^Z\|p_{m_z}-q_{n_z}\|

where \left \| \cdot \right \|_2 denotes L2 Norm

        ③The shortest distance between two sequences H^* lies in all the possible warping path \mathcal{H}:

H^{*}=\arg\min_{H\in\mathcal{H}}d_{H}

where d^*=d_{H^*} is the similarity measurement

        ④⭐The non-linear alignment method allow the pair contains different temporal indexes but the same order index, such as \left ( p_1,p_2,p_3,p_4 \right ) and \left ( q_1,q_1,q_1,q_2 \right )

        ⑤Furthermore, DTW allows two unequal length sequences, for instance, a sequence and its down sampling sequence

2.5.2. Local Dynamic Time Warping Attentions

        ①Generate vector sequence V=(v_1,...,v_i,...,v_I) by \mathcal{R} through task-dependent neural network, where the dimension of v_i is K_V

        ②Define a learnable Universal Feature (vector) sequence U, it has the same dimension as V

         ③The schematic of DTWAtt:

where the window size is L+1, the example subsequence is V_{i}=(v_{i},v_{i+1},\ldots,v_{i+L}), and they got the U_{i}=(u_{i},u_{i+1},\ldots,u_{i+L}) at the same length, the distances of all possible warping paths between they two are:

d\left(V_{i},U_{i}\right)=\left(d_{1}\left(V_{i},U_{i}\right),\ldots,d_{g}\left(V_{i},U_{i}\right),\ldots,d_{G}\left(V_{i},U_{i}\right)\right)

        ④Attention weight of each warping path:

\mathrm{ATT}\left(d_{g}\right)=\frac{\exp\left(-d_{g}\right)}{\sum_{j=1}^{G}\exp\left(-d_{j}\right)}

        ⑤The summed distance:

b=\sum_{g=1}^{G}\mathrm{ATT}\left(d_{g}\right)d_{g}

        ⑥For N parameter sequences \mathcal{U}=(U(1),...,U(N)), the output of multi head DTWAtt is C=(b_1,...,b_I)

2.5.3. The Model Output Layer

        ①TSC and TSF:

where Conv1D,128,3 denotes 128 conv kernels with 3 kernel size

2.6. Experiments

2.6.1. Task I: Time Series Classification

(1)Datasets

        ①Input: multivariate time series X \in \mathbb{R}^{K \times I}

        ②Datasets: 30 datasets from UEA multivariate time series classification

        ③Sample size: 24-50000

        ④Series length: 8-17984

        ⑤Dimension: 2-1345

(2)Baselines

        ①这八个baseline直接写了一页啊, they introduced a DTW-based algorithm, a patternbased algorithm, a feature-based algorithm, two ensemble methods, two deep learning models, and a wavelet-based deep learning model

        ②Baselines: DTW𝐷, WEASEL+MUSE, CMFMTS+RF, LCEM, TapNet, TST, MINIROCKET, mWDN and two variants of WHEN

(3)Results and Analysis

        ①Comparison table:

        ②Nemenyi test on compared models:

where difference beyonds CD on the top left denotes statistically significant difference, blue lines are WHEN and its variants, reds are DL models and oranges are traditional baselines

2.6.2. Task II:Time Series Forecasting

(1)Datasets

        ①Temperature: 50 dimensional time series with length of 164

        ②AQI: 6 dimensional time series with length of 2815

        ③Traffic: 214 dimensional time series with length of 4464

(2)Baselines

        ①Compared baselines: ARIMA, FC-LSTM, NRDE, STRIPE++, ESG, GTS, TST, mWDN, and two variants of WHEN

(3)Results and Analysis

        ①Comparison table:

2.6.3. Exploratory Analysis

(1)Data-Dependent Frequencies Extraction

        ①Data-dependent frequencies extracted by different wavelet families:

(2)Warping Distance Comparison in DTW Attention

        ①DWT vis:

2.7. Conclusion

        good good

3. 知识补充

3.1. dynamic time warping (DTW)

(1)定义:DTW允许通过非线性时间轴的拉伸和压缩来匹配序列,从而度量它们之间的相似度。与欧几里得距离等直接度量方法不同,DTW能够处理时序信号中的非线性时间变化,即它允许不同步或不对齐的序列进行相似性度量。DTW的本质是通过“动态规划”技术,在两条时间序列之间寻找一种最优的对齐路径,以最小化它们的距离。

(2)参考学习:DTW(Dynamic Time Warping)动态时间规整 - 知乎

3.2. Nemenyi test

(1)参考学习:非参数检验——Wilcoxon 检验 & Friedman 检验与 Nemenyi 后续检验_nemenyi检验-CSDN博客

4. Reference

Wang, J. et al. (2023) 'WHEN: A Wavelet-DTW Hybrid Attention Network for Heterogeneous Time Series Analysis', KDD, pp. 2361-2373. doi: WHEN: A Wavelet-DTW Hybrid Attention Network for Heterogeneous Time Series Analysis | Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值