[KDD 2023]WHEN: A Wavelet-DTW Hybrid Attention Network for Heterogeneous Time Series Analysis

②ISN attributes to inherently heterogeneous, such as mean, variance, frequency components, etc. And ISA is due to heterogeneous sampling rate or phase perturbation. They often solved by Dynamic time Warping (DTW) algorithm, but it's not deep learning method

③Taking ECG as example, the intra-sequence nonstationarity and inter-sequence asynchronism are like:

④Limitations of CNN and RNN on two hetero characters: the structures of deep model are repetitive（作者觉得重复的卷积核不能预测异质的东西，不过现在很多也不用普通卷积了嘛）

⑤Framework of Wavelet-DTW Hybrid attEntion Networks (WHEN):

2.3. Related Work

（1）Heterogeneous Time Series Analysis

①Traditional way: manually transform non-stationary time series to stationary

②Lists other DL methods

（2）Time Series Classification (TSC)

①3 types of methods: distance based methods, feature based methods and ensemble methods

（3）TimeSeriesForecasting(TSF)

①TFS relies on autoregressive correlation

2.4. Data-Dependent Wavelet Attention

2.4.1. Wavelet-based Frequency Analysis

①Define a mother wavelet function:

$\psi_{\alpha,\tau}(t)=\frac{1}{\sqrt{\alpha}}\psi\left(\frac{t-\tau}{\alpha}\right)$

where $\alpha \in \mathbb{R}^+$ denotes dilated scalar (frequency band), $\tau \in \mathbb{R}$ is shift

②For sequential signal $f\left ( t \right )$ and base $\psi_{\alpha,\tau}(t)$ , the WT is:

$r_{\alpha,\tau}=\int_{-\infty}^{+\infty}f(t) \frac{1}{\sqrt{\alpha}}\psi\left(\frac{t-\tau}{\alpha}\right) \mathrm{d}t$

where $r_{\alpha,\tau}$ is component intensity of frequency band $\frac{1}{\alpha}$ at location $\tau$

2.4.2. Data-dependent Wavelet Attention for Dynamic Frequency Analysis

①They design a data-dependent wavelet attention mechanism (WaveAtt):

②The original input: a multivariate time series $X=(x_1,x_2,\ldots,x_i,\ldots,x_I)^\top$ , where $x_i \in \mathbb{R}^{K_X}$ denotes $K_X$ dimensional vector

③They encode $X$ by Bidirectional Long Short Term Memory (BiLSTM) to hidden sequence $S=\left ( s_1,...,s_i,...,s_I \right )$ , it can capture the forward and backward temporal dependencies

④The input of WaveAtt: $S^{(k)}=\left(s_{1}^{(k)},\ldots,s_{i}^{(k)},\ldots,s_{I}^{(k)}\right)$ where $k$ denotes dimension

⑤Through sliding window of length $2L+1$ , an example sequence cetered on time step $i$ :

$s_{i}^{(k)}=\begin{pmatrix}s_{i-L}^{(k)},\ldots,s_{i-1}^{(k)}, s_{i}^{(k)}, s_{i+1}^{(k)},\ldots,s_{i+L}^{(k)}\end{pmatrix}$

⑥And they set

$\alpha(S_i)=\mathrm{ReLU}\left(w_b+\sum_{l=-L}^Lw_l\cdot s_{i+l}\right)+\epsilon$

where $\mathbf{w}=\left ( w_{-L},...,w_L,w_b \right )^T$ are trainable parameters, $\epsilon$ is a small value, $\alpha$ is positive

⑦Data-dependent wavelet function:

$\psi_{S_{i}}(t)=\frac{1}{\sqrt{\alpha\left(S_{i}\right)}}\psi\left(\frac{t-i}{\alpha\left(S_{i}\right)}\right)$

compared to common WT, fixed $\alpha$ is replaced by $\alpha(S_i)$ and $\tau =i$

⑧The attention weights:

$\mathrm{ATT}\left(\psi_{Si}\left(t\right)\right)=\frac{\psi_{Si}\left(t\right)}{\sum_{\tau=i-L}^{i+L}\left|\psi_{Si}\left(\tau\right)\right|}$

⑨The output of BiLSTM:

$r_{i}=\sum_{t=i-L}^{i+L}\mathrm{ATT}\left(\psi_{Si}(t)\right)\cdot s_{t}$

⑩To get diverse frequency components, they introduce $\Gamma$ wavelet families, in time step $i$ , the frequency components are:

$R_i=\left ( r_{i,1},...,r_{i,\Gamma } \right )$

⑪The final output of WaveAtt:

$\mathcal{R}=\left ( R_1,...,R_i,...,R_I \right )$

2.5. Dynamic Time Warping Attention

2.5.1. Dynamic Time Warping

①For two sequences, $P=\left ( p_1,...,p_m,...,p_M \right )$ and $Q=\left ( q_1,...,q_n,...,q_N \right )$ , Dynamic Time Warping (DTW) calculates:

$H=\left( \left(p_{m_{1}},q_{n_{1}}\right),\ldots,\left(p_{m_{z}},q_{n_{z}}\right),\ldots,\left(p_{m_{Z}},q_{n_{Z}}\right)\right)$

where $m_1=n_1=1$ , $m_Z=M$ , $n_Z=N$ , $0\leq m_{z+1}-m_{z}\leq1$ , $0\leq n_{z+1}-n_{z}\leq1$

②The distance between two sequences:

$d_H=\sum_{z=1}^Z\|p_{m_z}-q_{n_z}\|$

where $\left \| \cdot \right \|_2$ denotes L2 Norm

③The shortest distance between two sequences $H^*$ lies in all the possible warping path $\mathcal{H}$ :

$H^{*}=\arg\min_{H\in\mathcal{H}}d_{H}$

where $d^*=d_{H^*}$ is the similarity measurement

④⭐The non-linear alignment method allow the pair contains different temporal indexes but the same order index, such as $\left ( p_1,p_2,p_3,p_4 \right )$ and $\left ( q_1,q_1,q_1,q_2 \right )$

⑤Furthermore, DTW allows two unequal length sequences, for instance, a sequence and its down sampling sequence

2.5.2. Local Dynamic Time Warping Attentions

①Generate vector sequence $V=(v_1,...,v_i,...,v_I)$ by $\mathcal{R}$ through task-dependent neural network, where the dimension of $v_i$ is $K_V$

②Define a learnable Universal Feature (vector) sequence $U$ , it has the same dimension as $V$

③The schematic of DTWAtt:

where the window size is $L+1$ , the example subsequence is $V_{i}=(v_{i},v_{i+1},\ldots,v_{i+L})$ , and they got the $U_{i}=(u_{i},u_{i+1},\ldots,u_{i+L})$ at the same length, the distances of all possible warping paths between they two are:

$d\left(V_{i},U_{i}\right)=\left(d_{1}\left(V_{i},U_{i}\right),\ldots,d_{g}\left(V_{i},U_{i}\right),\ldots,d_{G}\left(V_{i},U_{i}\right)\right)$

④Attention weight of each warping path:

$\mathrm{ATT}\left(d_{g}\right)=\frac{\exp\left(-d_{g}\right)}{\sum_{j=1}^{G}\exp\left(-d_{j}\right)}$

⑤The summed distance:

$b=\sum_{g=1}^{G}\mathrm{ATT}\left(d_{g}\right)d_{g}$

⑥For $N$ parameter sequences $\mathcal{U}=(U(1),...,U(N))$ , the output of multi head DTWAtt is $C=(b_1,...,b_I)$

2.5.3. The Model Output Layer

①TSC and TSF:

where Conv1D,128,3 denotes 128 conv kernels with 3 kernel size

2.6. Experiments

2.6.1. Task I: Time Series Classification

（1）Datasets

①Input: multivariate time series $X \in \mathbb{R}^{K \times I}$

②Datasets: 30 datasets from UEA multivariate time series classification

③Sample size: 24-50000

④Series length: 8-17984

⑤Dimension: 2-1345

（2）Baselines

①这八个baseline直接写了一页啊, they introduced a DTW-based algorithm, a patternbased algorithm, a feature-based algorithm, two ensemble methods, two deep learning models, and a wavelet-based deep learning model

②Baselines: DTW𝐷, WEASEL+MUSE, CMFMTS+RF, LCEM, TapNet, TST, MINIROCKET, mWDN and two variants of WHEN

（3）Results and Analysis

①Comparison table:

②Nemenyi test on compared models:

where difference beyonds CD on the top left denotes statistically significant difference, blue lines are WHEN and its variants, reds are DL models and oranges are traditional baselines

2.6.2. Task II:Time Series Forecasting

（1）Datasets

①Temperature: 50 dimensional time series with length of 164

②AQI: 6 dimensional time series with length of 2815

③Traffic: 214 dimensional time series with length of 4464

（2）Baselines

①Compared baselines: ARIMA, FC-LSTM, NRDE, STRIPE++, ESG, GTS, TST, mWDN, and two variants of WHEN

（3）Results and Analysis

①Comparison table:

2.6.3. Exploratory Analysis

（1）Data-Dependent Frequencies Extraction

①Data-dependent frequencies extracted by different wavelet families:

（2）Warping Distance Comparison in DTW Attention

①DWT vis:

2.7. Conclusion

good good

3. 知识补充

3.1. dynamic time warping (DTW)

（1）定义：DTW允许通过非线性时间轴的拉伸和压缩来匹配序列，从而度量它们之间的相似度。与欧几里得距离等直接度量方法不同，DTW能够处理时序信号中的非线性时间变化，即它允许不同步或不对齐的序列进行相似性度量。DTW的本质是通过“动态规划”技术，在两条时间序列之间寻找一种最优的对齐路径，以最小化它们的距离。

（2）参考学习：DTW(Dynamic Time Warping)动态时间规整 - 知乎