Symbolic Aggregate approXimation.

14 篇文章 0 订阅


Introduction

In short, Symbolic Aggregate approXimation (SAX) algorithm application to the input time series transforms its into a strings.

The algoithm was proposed by Lin et al.) and extends the PAA-based approach inheriting the original algorithm simplicity and low computational complexity while providing satisfactory sensitivity and selectivity in range query processing. Moreover, the use of a symbolic representation opened a door to the existing wealth of data-structures and string-manipulation algorithms in computer science such as hashing, regular expression, pattern matching, suffix trees, and grammatical inference.

The algorithm

SAX transforms a time-series X of length n into the string of arbitrary length $\omega$, where $\omega « n$ typically, using an alphabet A of size a > 2. The algorithm consist of two steps: (i) it transforms the original time-series into the PAA representation and (ii) it converts the PAA data into a string.

The use of PAA brings advantages of a simple and efficient dimensionality reduction while providing the important lower bounding property. The actual conversion of PAA coefficients into letters by using a lookup table is also computationally efficient and the contractive property of symbolic distance was proven by Lin et al.

Discretization of the PAA representation of a time-series into SAX is implemented in a way which produces symbols corresponding to the time-series features with equal probability. The extensive and rigorous analysis of various time-series datasets available to the original algorithm’s authors has shown that the values of z-normalizedtime-series follow the Normal distribution. By using its properties it’s easy to pick a equal-sized areas under the Normal curve using lookup tables for the cut lines coordinates, slicing the under-the-Gaussian-curve area.

The x coordinates of these lines called breakpoints or cuts in the SAX context. The list of breakpoints $ B = \beta_{1}, \beta_{2} ,…, \beta_{a-1} $ such that $\beta_{i-1}<\beta_{i}$ and $\beta_{0}=-\infty$, $\beta_{a}=\infty$ divides the area under N(0,1) into a equal areas. By assigning a corresponding alphabet symbol $alpha_{j}$ to each interval $[\beta_{j-1},\beta_{j})$, the conversion of the vector of PAA coefficients $\bar{C}$ into the string $\hat{C}$ implemented as follows:

c^i=alphaj,iif,c¯i[βj1,βj) c^∗i=alpha∗j,iif,c¯∗i∈[βj−1,βj)

SAX introduces new metrics for measuring distance between strings by extending Euclidean and PAA distances. The function returning the minimal distance between two string representations of original time series $\hat{Q}$ and $\hat{C}$ is defined as

MINDIST(Q^,C^)nwi=1w(dist(q^i,c^i))2 MINDIST(Q^,C^)≡nw∑i=1w(dist(q^∗i,c^∗i))2

where the dist function is implemented by using the lookup table for the particular set of the breakpoints (alphabet size) as shown in the Table below, and where the singular value for each cell (r,c) is computed as

cell_(r,c)={0,if|rc|1βmax(r,c)1βmin(r,c)1,otherwise cell_(r,c)={0,if|r−c|≤1βmax(r,c)−1−βmin(r,c)−1,otherwise

The lookup table for 4-letters alphabet

      a          b          c          d     
a000.671.34
b0000.67
c0.67000
d1.340.6700

As shown by Li et al, this SAX distance metrics lower-bounds the PAA distance, i.e.

i=1n(qici)2n(Q¯C¯)2n(dist(Q^,C^))2 ∑i=1n(qi−ci)2≥n(Q¯−C¯)2≥n(dist(Q^,C^))2

The SAX lower bound was examined by Ding et al in great detail and found to be superior in precision to the spectral decomposition methods on bursty (non-periodic) data sets.


PAA approximates a time-series X of length n into vector $\bar{X}=(\bar{x}_{1},…,\bar{x}_{M})$ of any arbitrary length $ M \leq n $ where each of $ \bar{x_{i}} $ is calculated as follows:

x¯i=Mnj=n/M(i1)+1(n/M)ixj x¯i=Mn∑j=n/M(i−1)+1(n/M)ixj

Which simply means that in order to reduce the dimensionality from n to M, we first divide the original time-series into M equally sized frames and secondly compute the mean values for each frame. The sequence assembled from the mean values is the PAA approximation (i.e., transform) of the original time-series. As it was shown by Keogh et al, the complexity of the PAA transform can be reduced from O(NM) to O(Mm) where m is the number of frames. By using the following distance measure

DPAA(X¯,Y¯)nMi=1M(x¯iy¯i) DPAA(X¯,Y¯)≡nM∑i=1M(x¯i−y¯i)

Yi & Faloutsos, and Keogh et al, have shown that PAA satisfies to the lower bounding condition and guarantees no false dismissals, i.e.:

DPAA(X¯,Y¯)D(X,Y) DPAA(X¯,Y¯)≤D(X,Y)

Example

In this primer I use the next time series:

series1 <- c(2.02, 2.33, 2.99, 6.85, 9.20, 8.80, 7.50, 6.00, 5.85, 3.85, 4.85, 3.85, 2.22, 1.45, 1.34)

and the following R code:

paa <- function(ts, paa_size){
  len = length(ts)
  if (len == paa_size) {
    ts
  }
  else {
    if (len %% paa_size == 0) {
      colMeans(matrix(ts, nrow=len %/% paa_size, byrow=F))
    }
    else {
      res = rep.int(0, paa_size)
      for (i in c(0:(len * paa_size - 1))) {
        idx = i %/% len + 1# the spot
        pos = i %/% paa_size + 1 # the col spot
        res[idx] = res[idx] + ts[pos]
      }
      for (i in c(1:paa_size)) {
        res[i] = res[i] / len
      }
      res
    }
  }
}

whose application produces a seven-point piecewise aggregate approximation:

s1_paa = paa(series1,7)
(2.23, 5.62, 8.67, 6.36, 4.58, 3.33, 1.45)

or a 9-point approximation which is a bit trickier:

s1_paa = paa(series1,9)
(2.14, 3.63, 8.26, 8.28, 6.27, 4.65, 4.45, 2.39, 1.38)


SAX primer

1.0 Timeseries data

I will use following time series for this example (the Euclidean distance between ts1 and ts2 is 11.4):

> ts1=c(2.02, 2.33, 2.99, 6.85, 9.20, 8.80, 7.50, 6.00, 5.85, 3.85, 4.85, 3.85, 2.22, 1.45, 1.34)
> ts2=c(0.50, 1.29, 2.58, 3.83, 3.25, 4.25, 3.83, 5.63, 6.44, 6.25, 8.75, 8.83, 3.25, 0.75, 0.72)
> dist(rbind(ts1,ts2), method = "euclidean")
         ts1
ts2 11.42126

which I’ll transform into strings of length 9 whose letters come from an alphabet of size 4


2.0 Z-normalization

Before transforming timeseries with SAX we Z-normalize data first:

znorm <- function(ts){
  ts.mean <- mean(ts)
  ts.dev <- sd(ts)
  (ts - ts.mean)/ts.dev
}

ts1_znorm=znorm(ts1)
ts2_znorm=znorm(ts2)

PAA follows the standard procedure:

3.0 PAA transform

PAA

paa <- function(ts, paa_size){
  len = length(ts)
  if (len == paa_size) {
    ts
  }
  else {
    if (len %% paa_size == 0) {
      colMeans(matrix(ts, nrow=len %/% paa_size, byrow=F))
    }
    else {
      res = rep.int(0, paa_size)
      for (i in c(0:(len * paa_size - 1))) {
        idx = i %/% len + 1# the spot
        pos = i %/% paa_size + 1 # the col spot
        res[idx] = res[idx] + ts[pos]
      }
      for (i in c(1:paa_size)) {
        res[i] = res[i] / len
      }
      res
    }
  }
}
paa_size=9
s1_paa = paa(ts1_znorm,paa_size)
s2_paa = paa(ts2_znorm,paa_size)

4.0 PAA values to letters

I use the 4 symbols alphabet {a,b,c,d} as in the table above. The cut lines for this alphabet shown as the thin blue lines on the plot below.

SAX transform of ts1 into string through 9-points PAA: “abddccbaa”

SAX transform of ts2 into string through 9-points PAA: “abbccddba”

SAX distance: 0 + 0 + 0.67 + 0 + 0 + 0 + 0.67 + 0 + 0 = 1.34

At the plot, orange color depicts symbols distance between which is counted - they are not “adjacent” to each other in the table.

SAX符号化序列范例源码 -------------------- timeseries2symbol.m: -------------------- This function takes in a time series and convert it to string(s). There are two options: 1. Convert the entire time series to ONE string 2. Use sliding windows, extract the subsequences and convert these subsequences to strings For the first option, simply enter the length of the time series as "N" ex. We have a time series of length 32 and we want to convert it to a 8-symbol string, with alphabet size 3: timeseries2symbol(data, 32, 8, 3) For the second option, enter the desired sliding window length as "N" ex. We have a time series of length 32 and we want to extract subsequences of length 16 using sliding windows, and convert the subsequences to 8-symbol strings, with alphabet size 3: timeseries2symbol(data, 16, 8, 3) Input: data is the raw time series. N is the length of sliding window (use the length of the raw time series instead if you don't want to have sliding windows) n is the number of symbols in the low dimensional approximation of the sub sequence. alphabet_size is the number of discrete symbols. 2 <= alphabet_size > mindist_demo sax_version_of_A = 3 4 2 1 1 3 4 2 sax_version_of_B = 1 1 3 4 3 1 1 4 euclidean_distance_A_and_B = 10.9094 ans = 5.3600 ---> This is the mindist ----------------- symbolic_visual.m ----------------- This demo presents a visual comparison between SAX and PAA and shows how SAX can represent data in finer granularity while using the same, if not less, amount of space as PAA. The input parameter [data] is optional. The default # of PAA segments is 16, and the alphabet size is 4. -------- Examples: -------- You can type this up in your matlab: Recall that there are two options for timeseries2symbol. The first option is demonstrated in sax_demo.m Now here is an example of the latter. We are going to convert time series of length 50, with a sliding window of 32, into 8 symbols, with and alphabet size of 3. >> [symbolic_data, pointers] = timeseries2symbol(long_time_series,32,8,alphabet_size) symbolic_data = 1 1 3 3 3 3 1 1 1 2 3 3 3 2 1 1 1 3 3 3 3 1 1 1 2 3 3 3 2 1 1 1 3 3 3 3 1 1 1 1 3 3 3 2 1 1 1 2 3 3 3 1 1 1 1 3 3 3 2 1 1 1 2 3 3 3 1 1 1 1 3 3 3 2 1 1 1 2 3 3 pointers = 1 2 5 6 9 10 13 14 17 18 Note that each row corresponds to a subsequence (with overlap) The SAX word at 3 and 4 were omitted, since they where the same as the word at 2, same for 7 and 8, which were the same as 6 etc (look at the pointers) It might be helpful to view the data this way >> [pointers symbolic_data ] ans = 1 1 1 3 3 3 3 1 1 2 1 2 3 3 3 2 1 1 5 1 3 3 3 3 1 1 1 6 2 3 3 3 2 1 1 1 9 3 3 3 3 1 1 1 1 10 3 3 3 2 1 1 1 2 13 3 3 3 1 1 1 1 3 14 3 3 2 1 1 1 2 3 17 3 3 1 1 1 1 3 3 18 3 2 1 1 1 2 3 3 So the first word is (1 1 3 3 3 3 1 1) , the 9th word is (3 3 3 3 1 1 1 1) , the 14 word is (3 3 2 1 1 1 2 3)
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值